diff --git a/README.md b/README.md index b71c458..5f74693 100644 --- a/README.md +++ b/README.md @@ -1,13 +1,160 @@ -CUDA Stream Compaction +Project 2: CUDA Stream Compaction ====================== -**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** +**Course project for CIS 565: GPU Programming and Architecture, University of Pennsylvania** -* (TODO) YOUR NAME HERE -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Name: Meghana Seshadri +* Tested on: Windows 10, i7-4870HQ @ 2.50GHz 16GB, GeForce GT 750M 2048MB (personal computer) -### (TODO: Your README) -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +## Project Overview +The goal of this project was to get an introduction to the GPU stream compaction in CUDA. This algorithm is often used to quickly process large sets of data according to certain conditions - hence very useful in parallel programming. In this project, the stream compaction will remove any instance of a zero from an array of int's. + + +The following four versions of stream compaction were implemented: + +* CPU algorithm +* Naive parallel algorithm +* Work-efficient algorithm +* Thrust-based algorithm + + +The stream compaction algorithm breaks down into a scan function and a compat function. The scan function computes an exclusive prefix sum. The compact function maps an input array to an array of zero's and one's, scans it, and then uses scatter to produce the output. + +The following descriptions dive a bit deeper into the different approaches: + +### CPU +* `n - 1` adds for an array of length `n` --> O(n) number of adds + + +### Naive Parallel +* log2(n) passes on n elements --> O(n log2 (n)) numer of adds + + +### Work-Efficient +* This algorithm organizes the data into a balanced binary tree +* The occurs in two phases: + - Up-sweep = implementing parallel reduction + - Down-sweep = traversing back down the tree using partial sums to build the scan in place +* Up Sweep = O(n - 1) adds +* Down Sweep = O(n - 1) adds, O(n - 1) swaps + + +### Thrust-Based +* The Thrust library has a "thrust::exclusive_scan()" function that we use to compare the above algorithms to. + + +## Performance Analysis + +### Block Comparison + +Below is a graph that shows which would be the best block size to test on. The times below are for the work-efficient algorithm, comparing 64 elements and 65536 elements. + +![](images/blockcomparison.PNG) + +Based on the chart above, block size 128 on average, seems to be the most efficient. The rest of the analysis below will be using that block size. + + +### Algorithm Comparison + +Below are charts showing tests across all 4 algorithms, both testing with power-of-two sized arrays and non-power-of-two sized arrays. + + +![](images/algorithmcomparison.PNG) + + +![](images/algorithmcomparisontable.PNG) + + +First things first - why is the GPU implementation so much slower than the Naive Parallel and CPU versions? Isn't the GPU supposed to be faster? + +Take a look at the example diagram of the up sweep process as a part of the work-efficient algorithm below. + +![](images/upsweep.PNG) + +After iteration 0, only 4 out of the 8 elements are being computed and written to the output array. After iteration 1, only two are being computed and written, and after iteration 2, only one final answer is being outputted. + +Within each iteration, the same number of blocks are being launched. Essentially, every index is being allocated a block of memory (aka 32 threads). However, only one thread in 4 blocks for iteration 0, 2 blocks for iteration 1, and 1 block for iteration 2 are actually being used. This means that at least 4 blocks are not being used at all per iteration, and most of the threads per block are not even doing any work. So, really, the actual power of the GPU isn't being used at all. + +In order to optimize the code, it would be better to try and allocate only the necessary blocks needed per iteration, and to try and push computation to be done on earlier threads to achieve early termination of the block. In this example, this means that at iteration 0: + +* Having the computation and write at index 1 use thread #0 instead of thread #1 +* Having the computation and write at index 3 use thread #1 instead of thread #3 +* Having the computation and write at index 5 use thread #2 instead of thread #5 +* Having the computation and write at index 7 use thread #3 instead of thread #7 + +This way you only use the number of threads required and push them to be the first couple of ones so that the block can retire early and be used elsewhere. + + +Despite this explanation, it can be seen that Thrust implementation, up until size power of 14, becomes more efficient with increasing array size. It's interesting to note that there's a drastic increase in computation time across all algorithms with array size power of 14 and above. + + +### Scan and Stream Compaction Test Results + +The following tests were run on 128 elements at a block size of 128. + +``` +**************** +** SCAN TESTS ** +**************** + [ 40 12 3 21 27 2 32 20 1 4 41 34 15 ... 34 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 0.000821ms (std::chrono Measured) + [ 0 40 52 55 76 103 105 137 157 158 162 203 237 ... 6112 6146 ] +==== cpu scan, non-power-of-two ==== + elapsed time: 0.000821ms (std::chrono Measured) + [ 0 40 52 55 76 103 105 137 157 158 162 203 237 ... 6063 6064 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 0.44064ms (CUDA Measured) + [ 0 40 52 55 76 103 105 137 157 158 162 203 237 ... 6112 6146 ] + passed +==== naive scan, non-power-of-two ==== + elapsed time: 0.433024ms (CUDA Measured) + [ 0 40 52 55 76 103 105 137 157 158 162 203 237 ... 6063 6064 ] + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 0.781184ms (CUDA Measured) + [ 0 40 52 55 76 103 105 137 157 158 162 203 237 ... 6112 6146 ] + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 1.20867ms (CUDA Measured) + [ 0 40 52 55 76 103 105 137 157 158 162 203 237 ... 6099 6099 ] + [ 0 40 52 55 76 103 105 137 157 158 162 203 237 ... 6063 6064 ] + passed +==== thrust scan, power-of-two ==== + elapsed time: 2.34442ms (CUDA Measured) + [ 0 40 52 55 76 103 105 137 157 158 162 203 237 ... 6112 6146 ] + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 0.019392ms (CUDA Measured) + [ 0 40 52 55 76 103 105 137 157 158 162 203 237 ... 6063 6064 ] + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 2 0 1 3 1 0 0 2 1 0 1 0 1 ... 0 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 0.000821ms (std::chrono Measured) + [ 2 1 3 1 2 1 1 1 3 3 3 3 3 ... 1 1 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 0.000821ms (std::chrono Measured) + [ 2 1 3 1 2 1 1 1 3 3 3 3 3 ... 3 1 ] + passed +==== cpu compact with scan ==== + elapsed time: 0.962322ms (std::chrono Measured) + [ 2 1 3 1 2 1 1 1 3 3 3 3 3 ... 1 1 ] + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 2.22262ms (CUDA Measured) + [ 2 1 3 1 2 1 1 1 3 3 3 3 3 ... 1 1 ] + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 2.10656ms (CUDA Measured) + [ 2 1 3 1 2 1 1 1 3 3 3 3 3 ... 3 1 ] + passed +Press any key to continue . . . +``` \ No newline at end of file diff --git a/images/algorithmcomparison.PNG b/images/algorithmcomparison.PNG new file mode 100644 index 0000000..1b0fd00 Binary files /dev/null and b/images/algorithmcomparison.PNG differ diff --git a/images/algorithmcomparisontable.PNG b/images/algorithmcomparisontable.PNG new file mode 100644 index 0000000..8b330f0 Binary files /dev/null and b/images/algorithmcomparisontable.PNG differ diff --git a/images/blockcomparison.PNG b/images/blockcomparison.PNG new file mode 100644 index 0000000..c67cd28 Binary files /dev/null and b/images/blockcomparison.PNG differ diff --git a/images/kernelindexmanipulationtests.PNG b/images/kernelindexmanipulationtests.PNG new file mode 100644 index 0000000..1d647e5 Binary files /dev/null and b/images/kernelindexmanipulationtests.PNG differ diff --git a/images/upsweep.PNG b/images/upsweep.PNG new file mode 100644 index 0000000..31efd84 Binary files /dev/null and b/images/upsweep.PNG differ diff --git a/src/main.cpp b/src/main.cpp index 7305641..1c53b89 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -13,9 +13,28 @@ #include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array + +const int SIZE = 1 << 22; +//const int SIZE = 1 << 16; +//const int SIZE = 1 << 14; +//const int SIZE = 1 << 12; +//const int SIZE = 1 << 10; +//const int SIZE = 1 << 8; +//const int SIZE = 1 << 6; + +//const int SIZE = 1 << 8; // = 256, feel free to change the size of array +//const int SIZE = 1 << 5; // = 32, feel free to change the size of array +//const int SIZE = 1 << 4; // = 16, feel free to change the size of array +//const int SIZE = 1 << 3; // = 8, feel free to change the size of array +//const int SIZE = 1 << 2; // = 4, feel free to change the size of array + const int NPOT = SIZE - 3; // Non-Power-Of-Two -int a[SIZE], b[SIZE], c[SIZE]; +//int a[SIZE], b[SIZE], c[SIZE]; + +int *a = new int[SIZE]; +int *b = new int[SIZE]; +int *c = new int[SIZE]; + int main(int argc, char* argv[]) { // Scan tests @@ -49,42 +68,44 @@ int main(int argc, char* argv[]) { printDesc("naive scan, power-of-two"); StreamCompaction::Naive::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); printDesc("naive scan, non-power-of-two"); StreamCompaction::Naive::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(NPOT, c, true); printCmpResult(NPOT, b, c); zeroArray(SIZE, c); printDesc("work-efficient scan, power-of-two"); StreamCompaction::Efficient::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); printDesc("work-efficient scan, non-power-of-two"); StreamCompaction::Efficient::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); + + printArray(SIZE, c, true); + printArray(NPOT, c, true); printCmpResult(NPOT, b, c); zeroArray(SIZE, c); printDesc("thrust scan, power-of-two"); StreamCompaction::Thrust::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); printDesc("thrust scan, non-power-of-two"); StreamCompaction::Thrust::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); + printArray(NPOT, c, true); printCmpResult(NPOT, b, c); printf("\n"); @@ -129,15 +150,25 @@ int main(int argc, char* argv[]) { printDesc("work-efficient compact, power-of-two"); count = StreamCompaction::Efficient::compact(SIZE, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); + printArray(count, c, true); printCmpLenResult(count, expectedCount, b, c); zeroArray(SIZE, c); printDesc("work-efficient compact, non-power-of-two"); count = StreamCompaction::Efficient::compact(NPOT, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); + printArray(count, c, true); printCmpLenResult(count, expectedNPOT, b, c); system("pause"); // stop Win32 console from closing on exit + + //Copying this b/c aparently it stops from my exe window from crashing on the first go + //Not sure why + system("pause"); // stop Win32 console from closing on exit + + + free(a); + free(b); + free(c); } + diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index 8fc0211..db96609 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -1,5 +1,6 @@ #include "common.h" + void checkCUDAErrorFn(const char *msg, const char *file, int line) { cudaError_t err = cudaGetLastError(); if (cudaSuccess == err) { @@ -23,7 +24,22 @@ namespace StreamCompaction { * which map to 0 will be removed, and elements which map to 1 will be kept. */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { - // TODO + + // TODO + + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + + if (index < n) + { + if (idata[index] == 0) + { + bools[index] = 0; + } + else + { + bools[index] = 1; + } + }//end kernMapToBoolean function } /** @@ -33,7 +49,17 @@ namespace StreamCompaction { __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { // TODO - } + + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + + if (index < n) + { + if (bools[index] == 1) + { + odata[indices[index]] = idata[index]; + } + } + }//end kernScatter function } } diff --git a/stream_compaction/common.h b/stream_compaction/common.h index 55f1b38..485df32 100644 --- a/stream_compaction/common.h +++ b/stream_compaction/common.h @@ -1,18 +1,21 @@ #pragma once -#include -#include - -#include -#include -#include -#include +#include +#include + +#include +#include +#include +#include #include -#include +#include #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__) #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__) +//Block size used for CUDA kernel launch +#define blockSize 128 //Change this for performance analysis + /** * Check for CUDA errors; print and exit if there was a problem. */ @@ -37,96 +40,96 @@ namespace StreamCompaction { __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices); - /** - * This class is used for timing the performance - * Uncopyable and unmovable - * - * Adapted from WindyDarian(https://github.com/WindyDarian) - */ - class PerformanceTimer - { - public: - PerformanceTimer() - { - cudaEventCreate(&event_start); - cudaEventCreate(&event_end); - } - - ~PerformanceTimer() - { - cudaEventDestroy(event_start); - cudaEventDestroy(event_end); - } - - void startCpuTimer() - { - if (cpu_timer_started) { throw std::runtime_error("CPU timer already started"); } - cpu_timer_started = true; - - time_start_cpu = std::chrono::high_resolution_clock::now(); - } - - void endCpuTimer() - { - time_end_cpu = std::chrono::high_resolution_clock::now(); - - if (!cpu_timer_started) { throw std::runtime_error("CPU timer not started"); } - - std::chrono::duration duro = time_end_cpu - time_start_cpu; - prev_elapsed_time_cpu_milliseconds = - static_cast(duro.count()); - - cpu_timer_started = false; - } - - void startGpuTimer() - { - if (gpu_timer_started) { throw std::runtime_error("GPU timer already started"); } - gpu_timer_started = true; - - cudaEventRecord(event_start); - } - - void endGpuTimer() - { - cudaEventRecord(event_end); - cudaEventSynchronize(event_end); - - if (!gpu_timer_started) { throw std::runtime_error("GPU timer not started"); } - - cudaEventElapsedTime(&prev_elapsed_time_gpu_milliseconds, event_start, event_end); - gpu_timer_started = false; - } - - float getCpuElapsedTimeForPreviousOperation() //noexcept //(damn I need VS 2015 - { - return prev_elapsed_time_cpu_milliseconds; - } - - float getGpuElapsedTimeForPreviousOperation() //noexcept - { - return prev_elapsed_time_gpu_milliseconds; - } - - // remove copy and move functions - PerformanceTimer(const PerformanceTimer&) = delete; - PerformanceTimer(PerformanceTimer&&) = delete; - PerformanceTimer& operator=(const PerformanceTimer&) = delete; - PerformanceTimer& operator=(PerformanceTimer&&) = delete; - - private: - cudaEvent_t event_start = nullptr; - cudaEvent_t event_end = nullptr; - - using time_point_t = std::chrono::high_resolution_clock::time_point; - time_point_t time_start_cpu; - time_point_t time_end_cpu; - - bool cpu_timer_started = false; - bool gpu_timer_started = false; - - float prev_elapsed_time_cpu_milliseconds = 0.f; - float prev_elapsed_time_gpu_milliseconds = 0.f; + /** + * This class is used for timing the performance + * Uncopyable and unmovable + * + * Adapted from WindyDarian(https://github.com/WindyDarian) + */ + class PerformanceTimer + { + public: + PerformanceTimer() + { + cudaEventCreate(&event_start); + cudaEventCreate(&event_end); + } + + ~PerformanceTimer() + { + cudaEventDestroy(event_start); + cudaEventDestroy(event_end); + } + + void startCpuTimer() + { + if (cpu_timer_started) { throw std::runtime_error("CPU timer already started"); } + cpu_timer_started = true; + + time_start_cpu = std::chrono::high_resolution_clock::now(); + } + + void endCpuTimer() + { + time_end_cpu = std::chrono::high_resolution_clock::now(); + + if (!cpu_timer_started) { throw std::runtime_error("CPU timer not started"); } + + std::chrono::duration duro = time_end_cpu - time_start_cpu; + prev_elapsed_time_cpu_milliseconds = + static_cast(duro.count()); + + cpu_timer_started = false; + } + + void startGpuTimer() + { + if (gpu_timer_started) { throw std::runtime_error("GPU timer already started"); } + gpu_timer_started = true; + + cudaEventRecord(event_start); + } + + void endGpuTimer() + { + cudaEventRecord(event_end); + cudaEventSynchronize(event_end); + + if (!gpu_timer_started) { throw std::runtime_error("GPU timer not started"); } + + cudaEventElapsedTime(&prev_elapsed_time_gpu_milliseconds, event_start, event_end); + gpu_timer_started = false; + } + + float getCpuElapsedTimeForPreviousOperation() //noexcept //(damn I need VS 2015 + { + return prev_elapsed_time_cpu_milliseconds; + } + + float getGpuElapsedTimeForPreviousOperation() //noexcept + { + return prev_elapsed_time_gpu_milliseconds; + } + + // remove copy and move functions + PerformanceTimer(const PerformanceTimer&) = delete; + PerformanceTimer(PerformanceTimer&&) = delete; + PerformanceTimer& operator=(const PerformanceTimer&) = delete; + PerformanceTimer& operator=(PerformanceTimer&&) = delete; + + private: + cudaEvent_t event_start = nullptr; + cudaEvent_t event_end = nullptr; + + using time_point_t = std::chrono::high_resolution_clock::time_point; + time_point_t time_start_cpu; + time_point_t time_end_cpu; + + bool cpu_timer_started = false; + bool gpu_timer_started = false; + + float prev_elapsed_time_cpu_milliseconds = 0.f; + float prev_elapsed_time_gpu_milliseconds = 0.f; }; } } diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 05ce667..d061c57 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -1,50 +1,147 @@ #include #include "cpu.h" -#include "common.h" +#include "common.h" + +/* + * This stream compaction method will remove 0s from an array of ints. +*/ namespace StreamCompaction { namespace CPU { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; } /** - * CPU scan (prefix sum). + * CPU scan (prefix sum). Compute an EXCLUSIVE prefix sum * For performance analysis, this is supposed to be a simple for loop. * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first. */ void scan(int n, int *odata, const int *idata) { - timer().startCpuTimer(); - // TODO - timer().endCpuTimer(); - } + // TODO + + bool timerHasStartedElsewhere = false; + try + { + //If this fails (b/c it was called elsewhere), it'll go into the catch + timer().startCpuTimer(); + } + catch (std::runtime_error &e) + { + timerHasStartedElsewhere = true; + } + + odata[0] = 0; + for (int k = 1; k < n; k++) + { + odata[k] = odata[k - 1] + idata[k - 1]; + } + + //Only want to call endTimer if startTimer hasn't been called elsewhere + if (!timerHasStartedElsewhere) + { + timer().endCpuTimer(); + } + + }//end scan function /** * CPU stream compaction without using the scan function. * * @returns the number of elements remaining after compaction. + + * Notes: + Compute temp array containing 0's and 1's whether element meets criteria + */ int compactWithoutScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO - timer().endCpuTimer(); - return -1; + + // TODO + + //Fill odata with elements from idata that aren't 0 (if any, this array's size will be less than idata) + int counter = 0; + + for (int k = 0; k < n; k++) + { + if (idata[k] != 0) + { + odata[counter] = idata[k]; + counter++; + } + } + + timer().endCpuTimer(); + + return counter; } /** * CPU stream compaction using scan and scatter, like the parallel version. * * @returns the number of elements remaining after compaction. + + * Notes: + Map the input array to an array of 0s and 1s, scan it, + and use scatter to produce the output. + You will need a CPU scatter implementation for this + (see slides or GPU Gems chapter for an explanation). */ int compactWithScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + + // TODO + //Map input array to temp array of 0's and 1's + //int* itemp = new int[n]; + int* itemp = (int *)malloc(sizeof(int) * n); + int counter = 0; + + for (int k = 0; k < n; k++) + { + if (idata[k] != 0) + { + itemp[k] = 1; + counter++; + } + else + { + itemp[k] = 0; + } + } + + //Scan temp array + //int* otemp = new int[n]; + int* otemp = (int *)malloc(sizeof(int) * n); + scan(n, otemp, itemp); + + + //Use scatter to produce output + //Scan array values act as destination indices in odata + int counter2 = 0; + for (int k = 0; k < n; k++) + { + if (itemp[k] == 1) + { + int otemp_index = otemp[k]; + int idata_value = idata[k]; + + odata[otemp_index] = idata_value; + counter2++; + } + } + timer().endCpuTimer(); - return -1; + + //delete[] itemp; + //delete[] otemp; + free(itemp); + free(otemp); + + return counter; } } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 36c5ef2..3987969 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -5,21 +5,156 @@ namespace StreamCompaction { namespace Efficient { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; } + __global__ void upSweep(int n, int factorPlusOne, int factor, int addTimes, int *idata) + { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + + if (index < addTimes) + { + int newIndex = (factorPlusOne * (index + 1)) - 1; + + if (newIndex < n) + { + idata[newIndex] += idata[newIndex - factor]; + + //if (newIndex == n - 1) + //{ + // idata[newIndex] = 0; + //} + } + } + + + + }//end upSweep function + + __global__ void downSweep(int n, int factorPlusOne, int factor, int addTimes, int *idata) + { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + + if (index < addTimes) + { + int newIndex = (factorPlusOne * (index + 1)) - 1; + + if (newIndex < n) + { + int leftChild = idata[newIndex - factor]; + idata[newIndex - factor] = idata[newIndex]; + idata[newIndex] += leftChild; + } + } + + }//end downSweep function + + + __global__ void resizeArray(int n, int new_n, int *idata) + { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index < new_n && index >= n) + { + idata[index] = 0; + } + } + + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. + + * Notes: + Most of the text in Part 2 applies. + This uses the "Work-Efficient" algorithm from GPU Gems 3, Section 39.2.2. + This can be done in place - it doesn't suffer from the race conditions of the naive method, + since there won't be a case where one thread writes to and another thread reads from the same location in the array. + Beware of errors in Example 39-2. Test non-power-of-two-sized arrays. + Since the work-efficient scan operates on a binary tree structure, it works best with arrays with power-of-two length. + Make sure your implementation works on non-power-of-two sized arrays (see ilog2ceil). + This requires extra memory, so your intermediate array sizes + will need to be rounded to the next power of two. */ void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); - } + // TODO + + //If non-power-of-two sized array, round to next power of two + int new_n = 1 << ilog2ceil(n); + + dim3 fullBlocksPerGrid((new_n + blockSize - 1) / blockSize); + + int *inArray; + cudaMalloc((void**)&inArray, new_n * sizeof(int)); + checkCUDAError("cudaMalloc inArray failed!"); + + //Copy input data to device array and resize if necessary + cudaMemcpy(inArray, idata, sizeof(int) * new_n, cudaMemcpyHostToDevice); + resizeArray<<>>(n, new_n, inArray); + + bool timerHasStartedElsewhere = false; + try + { + timer().startGpuTimer(); + } + catch (std::runtime_error &e) + { + timerHasStartedElsewhere = true; + } + + dim3 newNumBlocks = fullBlocksPerGrid; + + + //Up sweep + for (int d = 0; d <= ilog2ceil(n) - 1; d++) + { + int factorPlusOne = 1 << (d + 1); //2^(d + 1) + int factor = 1 << d; //2^d + + int addTimes = 1 << (ilog2ceil(n) - 1 - d); + + newNumBlocks = ((new_n / factorPlusOne) + blockSize - 1) / blockSize; + + upSweep<<>>(new_n, factorPlusOne, factor, addTimes, inArray); + + //Make sure the GPU finishes before the next iteration of the loop + cudaThreadSynchronize(); + + + } + + //Down sweep + int lastElem = 0; + cudaMemcpy(inArray + (new_n - 1), &lastElem, sizeof(int) * 1, cudaMemcpyHostToDevice); + + for (int d = ilog2ceil(n) - 1; d >= 0; d--) + { + int factorPlusOne = 1 << (d + 1); //2^(d + 1) + int factor = 1 << d; //2^d + + int addTimes = 1 << (ilog2ceil(n) - 1 - d); + + newNumBlocks = ((new_n / factor) + blockSize - 1) / blockSize; + + downSweep<<>>(new_n, factorPlusOne, factor, addTimes, inArray); + + + cudaThreadSynchronize(); + } + + if (!timerHasStartedElsewhere) + { + timer().endGpuTimer(); + } + + //Transfer to odata + cudaMemcpy(odata, inArray, sizeof(int) * (new_n), cudaMemcpyDeviceToHost); + + //Free the arrays + cudaFree(inArray); + }//end scan function /** * Performs stream compaction on idata, storing the result into odata. @@ -31,10 +166,78 @@ namespace StreamCompaction { * @returns The number of elements remaining after compaction. */ int compact(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); - return -1; - } - } -} + // TODO + + int new_n = 1 << ilog2ceil(n); + + dim3 fullBlocksPerGrid((new_n + blockSize - 1) / blockSize); + + int *inArray; + int *boolsArray; + cudaMalloc((void**)&inArray, new_n * sizeof(int)); + checkCUDAError("cudaMalloc inArray failed!"); + cudaMalloc((void**)&boolsArray, new_n * sizeof(int)); + checkCUDAError("cudaMalloc boolsArray failed!"); + + int* scan_in = (int *)malloc(sizeof(int) * new_n); + int* scan_out = (int *)malloc(sizeof(int) * new_n); + + int *scatter_in; + int *scatter_out; + cudaMalloc((void**)&scatter_in, new_n * sizeof(int)); + checkCUDAError("cudaMalloc scatter_in failed!"); + cudaMalloc((void**)&scatter_out, new_n * sizeof(int)); + checkCUDAError("cudaMalloc scatter_out failed!"); + + cudaThreadSynchronize(); + + + //Copy input data to device array + cudaMemcpy(inArray, idata, sizeof(int) * new_n, cudaMemcpyHostToDevice); + resizeArray<<>>(n, new_n, inArray); + + timer().startGpuTimer(); + + //Call kernMapToBoolean to map values to bool array + Common::kernMapToBoolean<<>>(new_n, boolsArray, inArray); + + //Copy back to host array, find how many fulfilled condition, and run exclusive scan + cudaMemcpy(scan_in, boolsArray, sizeof(int) * new_n, cudaMemcpyDeviceToHost); + + int numPassedElements = 0; + for (int i = 0; i < new_n; i++) + { + if (scan_in[i] == 1) + { + numPassedElements++; + } + } + + scan(new_n, scan_out, scan_in); + + //Copy output of CPU scan to scatter device array + cudaMemcpy(scatter_in, scan_out, sizeof(int) * new_n, cudaMemcpyHostToDevice); + + //Call kernScatter with scanned boolsArray + Common::kernScatter<<>>(new_n, scatter_out, inArray, boolsArray, scatter_in); + + timer().endGpuTimer(); + + //SCATTER OUT ISNT GONNA BE THE SAME SIZE AS N + //Should I replace n with numPassedElements? + cudaMemcpy(odata, scatter_out, sizeof(int) * numPassedElements, cudaMemcpyDeviceToHost); + + //Free the arrays + free(scan_in); + free(scan_out); + cudaFree(inArray); + cudaFree(boolsArray); + cudaFree(scatter_in); + cudaFree(scatter_out); + checkCUDAError("cudaFree failed!"); + + return numPassedElements; + + }//end compact function + }//end namespace Efficient +}//end namespace StreamCompaction diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h index 803cb4f..7d9be73 100644 --- a/stream_compaction/efficient.h +++ b/stream_compaction/efficient.h @@ -6,6 +6,12 @@ namespace StreamCompaction { namespace Efficient { StreamCompaction::Common::PerformanceTimer& timer(); + __global__ void upSweep(int n, int factorPlusOne, int factor, int addTimes, int *idata); + + __global__ void downSweep(int n, int factorPlusOne, int factor, int addTimes, int *idata); + + __global__ void resizeArray(int n, int new_n, int *idata); + void scan(int n, int *odata, const int *idata); int compact(int n, int *odata, const int *idata); diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 9218f8e..17a55ad 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -3,23 +3,101 @@ #include "common.h" #include "naive.h" + namespace StreamCompaction { namespace Naive { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; } + + /* + Notes: + + This uses the "Naive" algorithm from GPU Gems 3, Section 39.2.1. + Example 39-1 uses shared memory. This is not required in this project. You can simply use global memory. + As a result of this, you will have to do ilog2ceil(n) separate kernel invocations. + + Since your individual GPU threads are not guaranteed to run simultaneously, + you can't generally operate on an array in-place on the GPU; it will cause race conditions. + Instead, create two device arrays. Swap them at each iteration: + read from A and write to B, read from B and write to A, and so on. + + Beware of errors in Example 39-1 in the chapter; + both the pseudocode and the CUDA code in the online version of Chapter 39 + are known to have a few small errors (in superscripting, missing braces, bad indentation, etc.) + + Be sure to test non-power-of-two-sized arrays. + */ + // TODO: __global__ + __global__ void computeNaiveScanHelper(int n, int factor, int *odata, const int *idata) + { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + + if (index < n) + { + if (index >= factor) + { + odata[index] = idata[index - factor] + idata[index]; + } + else + { + odata[index] = idata[index]; + } + } + }//end computeNaiveScanHelper + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. - */ + */ void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); - } - } -} + // TODO + + //create 2 device arrays + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + int *inArray; + int *outArray; + cudaMalloc((void**)&inArray, n * sizeof(int)); + checkCUDAError("cudaMalloc inArray failed!"); + cudaMalloc((void**)&outArray, n * sizeof(int)); + checkCUDAError("cudaMalloc outArray failed!"); + cudaThreadSynchronize(); + + //Copy input data to GPU + cudaMemcpy(inArray, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + + timer().startGpuTimer(); + + for (int d = 1; d <= ilog2ceil(n); d++) + { + //Call kernel + int factor = 1 << (d - 1); + computeNaiveScanHelper<<>>(n, factor, outArray, inArray); + + //Make sure the GPU finishes before the next iteration of the loop + cudaThreadSynchronize(); + + //Swap arrays at each iteration + std::swap(outArray, inArray); + } + + timer().endGpuTimer(); + + //Copy output data back to CPU + //Make sure you're copying from inArray since you swap them every iteration + odata[0] = 0; + cudaMemcpy(odata + 1, inArray, sizeof(int) * (n - 1), cudaMemcpyDeviceToHost); + + //FREE THE ARRAYS + cudaFree(inArray); + cudaFree(outArray); + checkCUDAError("cudaFree failed!"); + + + }//end scan function + }//end namespace Naive +}//end namespace StreamCompaction diff --git a/stream_compaction/naive.h b/stream_compaction/naive.h index 37dcb06..a34d1e5 100644 --- a/stream_compaction/naive.h +++ b/stream_compaction/naive.h @@ -6,6 +6,8 @@ namespace StreamCompaction { namespace Naive { StreamCompaction::Common::PerformanceTimer& timer(); + __global__ void computeNaiveScanHelper(int n, int d, int *odata, const int *idata); + void scan(int n, int *odata, const int *idata); } } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 36b732d..f3f8f07 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -8,21 +8,61 @@ namespace StreamCompaction { namespace Thrust { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. + + Notes: + This should be a very short function which wraps a call to the + Thrust library function thrust::exclusive_scan(first, last, result). + + To measure timing, be sure to exclude memory operations by passing + exclusive_scan a thrust::device_vector (which is already allocated on the GPU). + You can create a thrust::device_vector by creating a + thrust::host_vector from the given pointer, then casting it. + + For thrust stream compaction, take a look at thrust::remove_if. + It's not required to analyze thrust::remove_if but you're encouraged to do so. + + Thrust Documentation: http://docs.nvidia.com/cuda/thrust/index.html + Thrust Quick Start Guide: https://github.com/thrust/thrust/wiki/Quick-Start-Guide */ void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); // TODO use `thrust::exclusive_scan` // example: for device_vectors dv_in and dv_out: // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); - timer().endGpuTimer(); + + //Create host vectors + thrust::host_vector host_vector_in(n); + + //Use thrust::copy to copy data between host and device + thrust::copy(idata, idata + n, host_vector_in.begin()); + + //Create device vectors to put into thrust::exclusive_scan. + //Cast the host vector to device vector + thrust::device_vector dev_vector_in = host_vector_in; + thrust::device_vector dev_vector_out(n);// = host_vector_out; + + //Can also do it this way -- but isn't any faster + //thrust::device_vector dev_vector_in(idata, idata + n); + //thrust::device_vector dev_vector_out(odata, odata + n); + + //Placing an exclusive_scan call once here outside the timer and once inside the timer block + //apparently neutralizes the difference between POT and NPOT sized arrays + + timer().startGpuTimer(); + + thrust::exclusive_scan(dev_vector_in.begin(), dev_vector_in.end(), dev_vector_out.begin()); + + timer().endGpuTimer(); + + //Copy the data back to odata + thrust::copy(dev_vector_out.begin(), dev_vector_out.end(), odata); } } }