diff --git a/INSTRUCTION.md b/INSTRUCTION.md index 792c530..d8e0397 100644 --- a/INSTRUCTION.md +++ b/INSTRUCTION.md @@ -310,4 +310,4 @@ The template of the comment section of your pull request is attached below, you + You can try multi-threading on CPU if you want (not required and not our focus) + for each element input[i] in original array - if it's non-zero (given by mapped array) - - then put it at output[index], where index = scanned[i] \ No newline at end of file + - then put it at output[index], where index = scanned[i] diff --git a/README.md b/README.md index b71c458..fe3fc01 100644 --- a/README.md +++ b/README.md @@ -3,11 +3,37 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Anton Khabbaz +* pennkey:akhabbaz +* Tested on: Windows 10 surface book i7-6600u at 2.66 GHz with a GPU GTX 965M +Personal computer ### (TODO: Your README) -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +This project aimed to understand parallel algorithms and how they are implemented on the GPU. A scan takes a running sum of integers and here we used the scan to implement three versions of scan: a cpu version in series, a naive GPU version, a work efficient version, +and one where we used the thrust library. + +To do the work efficient scan I implemented the extra credit. All the threads are contiguous in both the up and down sweep. Furthermore, I trimmed the number of threads needed so that the grid size was enough to run all the threads needed. In the end with some 10^8 data elements, my work efficient scan matched the CPU scan in speed. The GPU and CPU timers measured computation, not the time to copy and transfer data. + +For all number of elements all the tests passed. + +![](img/RunTimes.png) +THis figure plots the log10 of the run time per element in the array. This measure was relatively constant as N increased. Here we can see that the work efficient scan at first is worse than the CPU scan but as the number of elements increases, it first beats the naive scan and then matches the CPU scan. Thrust on the other hand is the fastest scan for large data (faster by a factor of about 6). +The non power of two scans were slightly faster but comparable in speed to the power of two scans. + +Stream Compaction is one application of scan and it allows one to remove elements from a stream. Here around 10^5 elements, the work efficient compaction actually beat the other compactions. A thrust scan would beat the work efficient compaction, since a majority of the time to compact is spent making the exclusive scan. + +![](img/StreamCompactionTimes.png) + + +Here I implemented the efficient scan using contiguous threads. This worked perfectly up to one block but beyond one block the code failed. The issue was that threads beyond one block do not synchronize. + +I had many difficulties. First I used syncthreads to synchronize and that worked only when one block had all the threads. I then changed my code to iterate over strides on the CPU and this allowed all the blocks to be synchronized. + +Another major issue I had was that aroud 2^15 or so, I got an allocation error. I traced it down to the work efficient scan and then used Memtracker in cuda to catch the out of bounds error. That showed me that I was multiplying 2 65K integers to calculate the index (some threads had high indices). That produced a negative index. I got around that by returning depending on the stride, so that the actual index would never be larger that the avalable array. Ultimately I also chose the grid size so that the minimum number of threads would be available, a number that varied with the stride. THis culled the number of possible threads and sped up the run time. + + +A final issue I had was that in the compaction I did not return when index was beyond the array size. That caused multiple writes to the same location and that was hard to debug. + +The code now works for any N I tried up to the maximum memory on my GPU, about 2 GB. diff --git a/RunTimes.xlsx b/RunTimes.xlsx new file mode 100644 index 0000000..342477e Binary files /dev/null and b/RunTimes.xlsx differ diff --git a/img/RunTimes.pdf b/img/RunTimes.pdf new file mode 100644 index 0000000..4666c23 Binary files /dev/null and b/img/RunTimes.pdf differ diff --git a/img/RunTimes.png b/img/RunTimes.png new file mode 100644 index 0000000..3948a97 Binary files /dev/null and b/img/RunTimes.png differ diff --git a/img/StreamCompactionTimes.pdf b/img/StreamCompactionTimes.pdf new file mode 100644 index 0000000..6953cd9 Binary files /dev/null and b/img/StreamCompactionTimes.pdf differ diff --git a/img/StreamCompactionTimes.png b/img/StreamCompactionTimes.png new file mode 100644 index 0000000..cbeef67 Binary files /dev/null and b/img/StreamCompactionTimes.png differ diff --git a/src/main.cpp b/src/main.cpp index 7305641..4feb0d6 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -12,19 +12,23 @@ #include #include #include "testing_helpers.hpp" +#include -const int SIZE = 1 << 8; // feel free to change the size of array +const int SIZE = 1 << 26;// feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two -int a[SIZE], b[SIZE], c[SIZE]; +//int a[SIZE], b[SIZE], c[SIZE]; int main(int argc, char* argv[]) { - // Scan tests + // Scan tests + int* a = static_cast(malloc(SIZE * sizeof(int))); + int* b = static_cast(malloc(SIZE * sizeof(int))); + int* c = static_cast(malloc(SIZE * sizeof(int))); printf("\n"); printf("****************\n"); printf("** SCAN TESTS **\n"); printf("****************\n"); - + printf("Size: %d\n", SIZE); genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case a[SIZE - 1] = 0; printArray(SIZE, a, true); @@ -45,12 +49,14 @@ int main(int argc, char* argv[]) { printArray(NPOT, b, true); printCmpResult(NPOT, b, c); - zeroArray(SIZE, c); - printDesc("naive scan, power-of-two"); + + + printDesc("naive scan, power-of-two"); StreamCompaction::Naive::scan(SIZE, c, a); - printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(cuda measured)"); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); + zeroArray(SIZE, c); printDesc("naive scan, non-power-of-two"); @@ -63,14 +69,14 @@ int main(int argc, char* argv[]) { printDesc("work-efficient scan, power-of-two"); StreamCompaction::Efficient::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); printDesc("work-efficient scan, non-power-of-two"); StreamCompaction::Efficient::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); + printArray(NPOT, c, true); printCmpResult(NPOT, b, c); zeroArray(SIZE, c); @@ -129,15 +135,18 @@ int main(int argc, char* argv[]) { printDesc("work-efficient compact, power-of-two"); count = StreamCompaction::Efficient::compact(SIZE, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); + printArray(count, c, true); printCmpLenResult(count, expectedCount, b, c); zeroArray(SIZE, c); printDesc("work-efficient compact, non-power-of-two"); count = StreamCompaction::Efficient::compact(NPOT, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); + printArray(count, c, true); printCmpLenResult(count, expectedNPOT, b, c); system("pause"); // stop Win32 console from closing on exit + free(a); + free(b); + free(c); } diff --git a/src/testing_helpers.hpp b/src/testing_helpers.hpp index ae94ca6..f0257a1 100644 --- a/src/testing_helpers.hpp +++ b/src/testing_helpers.hpp @@ -44,7 +44,7 @@ void zeroArray(int n, int *a) { } void genArray(int n, int *a, int maxval) { - srand(time(nullptr)); + // srand(time(nullptr)); for (int i = 0; i < n; i++) { a[i] = rand() % maxval; diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt index cdbef77..6ad13f6 100644 --- a/stream_compaction/CMakeLists.txt +++ b/stream_compaction/CMakeLists.txt @@ -13,5 +13,5 @@ set(SOURCE_FILES cuda_add_library(stream_compaction ${SOURCE_FILES} - OPTIONS -arch=sm_20 + OPTIONS -arch=sm_52 ) diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index 8fc0211..f3d7519 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -23,7 +23,11 @@ namespace StreamCompaction { * which map to 0 will be removed, and elements which map to 1 will be kept. */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { - // TODO + int index = threadIdx.x + blockIdx.x * blockDim.x; + if (index >= n) { + return; + } + bools[index] = ( idata[index] == 0) ? 0 : 1; } /** @@ -32,7 +36,13 @@ namespace StreamCompaction { */ __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { - // TODO + int index = threadIdx.x + blockIdx.x * blockDim.x; + if (index >= n) { + return; + } + if ( bools[index] == 1) { + odata[indices[index]] = idata[index]; + } } } diff --git a/stream_compaction/common.h b/stream_compaction/common.h index 55f1b38..fe5276e 100644 --- a/stream_compaction/common.h +++ b/stream_compaction/common.h @@ -1,5 +1,5 @@ -#pragma once - +#pragma once + #include #include @@ -7,36 +7,39 @@ #include #include #include -#include +#include #include - -#define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__) -#define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__) - -/** - * Check for CUDA errors; print and exit if there was a problem. - */ -void checkCUDAErrorFn(const char *msg, const char *file = NULL, int line = -1); - -inline int ilog2(int x) { - int lg = 0; - while (x >>= 1) { - ++lg; - } - return lg; -} - -inline int ilog2ceil(int x) { - return ilog2(x - 1) + 1; -} - -namespace StreamCompaction { - namespace Common { - __global__ void kernMapToBoolean(int n, int *bools, const int *idata); - - __global__ void kernScatter(int n, int *odata, - const int *idata, const int *bools, const int *indices); - + +#define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__) +#define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__) + +/** + * Check for CUDA errors; print and exit if there was a problem. + */ +void checkCUDAErrorFn(const char *msg, const char *file = NULL, int line = -1); + +inline int ilog2(int x) { + if (x < 0) { + return 0; + } + int lg = 0; + while (x >>= 1) { + ++lg; + } + return lg; +} + +inline int ilog2ceil(int x) { + return ilog2(x - 1) + 1; +} + +namespace StreamCompaction { + namespace Common { + __global__ void kernMapToBoolean(int n, int *bools, const int *idata); + + __global__ void kernScatter(int n, int *odata, + const int *idata, const int *bools, const int *indices); + /** * This class is used for timing the performance * Uncopyable and unmovable @@ -127,6 +130,6 @@ namespace StreamCompaction { float prev_elapsed_time_cpu_milliseconds = 0.f; float prev_elapsed_time_gpu_milliseconds = 0.f; - }; - } -} + }; + } +} diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 05ce667..981593c 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -1,50 +1,93 @@ -#include -#include "cpu.h" - +#include +#include "cpu.h" + #include "common.h" - -namespace StreamCompaction { - namespace CPU { + +namespace StreamCompaction { + namespace CPU { using StreamCompaction::Common::PerformanceTimer; PerformanceTimer& timer() { static PerformanceTimer timer; return timer; - } - - /** - * CPU scan (prefix sum). - * For performance analysis, this is supposed to be a simple for loop. - * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first. - */ - void scan(int n, int *odata, const int *idata) { - timer().startCpuTimer(); - // TODO - timer().endCpuTimer(); - } - - /** - * CPU stream compaction without using the scan function. - * - * @returns the number of elements remaining after compaction. - */ - int compactWithoutScan(int n, int *odata, const int *idata) { - timer().startCpuTimer(); - // TODO - timer().endCpuTimer(); - return -1; - } - - /** - * CPU stream compaction using scan and scatter, like the parallel version. - * - * @returns the number of elements remaining after compaction. - */ - int compactWithScan(int n, int *odata, const int *idata) { - timer().startCpuTimer(); - // TODO - timer().endCpuTimer(); - return -1; - } - } -} + } + + + // occupied array converts an array of integers to an array of 0, 1s + + void boolArray(int n, int * odata, const int * idata){ + + for (int i{0}; i < n; ++i){ + odata[i] = (idata[i] == 0) ? 0 : 1; + } + } + /** + * CPU scan (prefix sum). + * For performance analysis, this is supposed to be a simple for loop. + * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first. + */ + // implement the simplest for loop to do the sum of prior elements + void scanNoTimer(int n, int *odata, const int *idata) { + // odata[0] = 0; + // for (int i{ 1 }; i < n; ++i) { + // odata[i] = odata[i - 1] + idata[i - 1]; + // } + int prior {0}; + for (int i{0}; i < n; ++i){ + *(odata+i) = prior; + prior = prior + idata[i]; + } + } + void scan(int n, int *odata, const int *idata) { + timer().startCpuTimer(); + scanNoTimer(n, odata, idata); + timer().endCpuTimer(); + } + + /** + * CPU stream compaction without using the scan function. + * + * @returns the number of elements remaining after compaction. + */ + int compactWithoutScan(int n, int *odata, const int *idata) { + timer().startCpuTimer(); + int size {0}; + for (int i {0}; i < n; ++i) + { + if ( idata[i] != 0) { + odata[size++] = idata[i]; + } + } + + timer().endCpuTimer(); + return size; + } + + /** + * CPU stream compaction using scan and scatter, like the parallel version. + * + * @returns the number of elements remaining after compaction. + */ + int compactWithScan(int n, int *odata, const int *idata) { + + // timer().startCpuTimer(); + // allocation is costly and may dominate the time. + int * boolarray { new int[n]}; + timer().startCpuTimer(); + // turn idata into an array of 0 and 1s + boolArray(n, boolarray, idata); + // odata is now the exclusive prefix sum. + scanNoTimer(n, odata, boolarray); + int lastIndex{-1}; + for (int i {0}; i < n; ++i) + { + if ( boolarray[i] ) { + lastIndex = odata[i]; + odata[lastIndex] = idata[i]; + } + } + timer().endCpuTimer(); + return lastIndex + 1; + } + } +} diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 36c5ef2..4b6c770 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -2,39 +2,220 @@ #include #include "common.h" #include "efficient.h" +#include +static const int blockSize{ 256 }; + +void printArray2(int n, int *a, bool abridged = false) { + printf(" [ "); + for (int i = 0; i < n; i++) { + if (abridged && i + 2 == 15 && n > 16) { + i = n - 2; + printf("... "); + } + printf("%3d ", a[i]); + } + printf("]\n"); +} + + namespace StreamCompaction { - namespace Efficient { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; - } - - /** - * Performs prefix-sum (aka scan) on idata, storing the result into odata. - */ - void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); - } - - /** - * Performs stream compaction on idata, storing the result into odata. - * All zeroes are discarded. - * - * @param n The number of elements in idata. - * @param odata The array into which to store elements. - * @param idata The array of elements to compact. - * @returns The number of elements remaining after compaction. - */ - int compact(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); - return -1; - } - } + namespace Efficient { + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; + } + + // initialize the arrays on the GPU + // returns the addresses of the pointers on the GPU + // dev_idata is a pointer to the address of the dev_idata array that + // gets updated here + // initialize dev_idata has the first + // elements copied and the remainder to make the stream 2^n + // are set to 0. The first input is the size of the arrays + // to allocate and the second input is the size of the array to transfer. + // N the maximum size of the allocated array. n is the size of the data array + // N is one more than the multiple of 2 greater or equal to n, + // in dev_idata, and then the elements are copied inte dev_idata. + + void initScan(int N, int n, const int *idata, int ** dev_idata) + { + int size{ sizeof(int) }; + cudaMalloc(reinterpret_cast(dev_idata), N * size); + checkCUDAError("Allocating Scan Buffer Efficient Error"); + cudaMemset(*dev_idata, 0, N * size); + cudaMemcpy(*dev_idata, idata, n *size, cudaMemcpyHostToDevice); + // no need to initialize the odata because the loop does that each time + checkCUDAError("Initialize and Copy data to target Error"); + cudaThreadSynchronize(); + } + // transfer scan data back to host + void transferScan(int N, int * odata, int * dev_odata) + { + cudaMemcpy(odata, dev_odata, N * sizeof(int), cudaMemcpyDeviceToHost); + } + + // end the scan on the device. + void endScan(int * dev_idata) + { + cudaFree(dev_idata); + } + // kernParallelReduction uses contiguous threads to do the parallel reduction + // There is one thread for every two elements + __global__ void kernParallelReduction(int N, int Stride, + int maxThreads, int * dev_idata) + { + int thread = threadIdx.x + blockIdx.x * blockDim.x; + if (thread >= maxThreads) { + return; + } + int priorStride{ Stride >> 1 }; + int index = (thread + 1) * Stride - 1; + if (index < N) { + dev_idata[index] += dev_idata[index - priorStride]; + } + } + // Downsweep uses contiguous threads to sweep down and add the intermediate + // results to the partial sums already computed + // There is one thread for every two elements. Here there is a for loop + // that changes the stride. Contiguous allows the first threads to do all + // the work and later warps will all be 0. + __global__ void kernDownSweep(int N, int stride, int maxThreads, int * dev_idata) + { + int thread = threadIdx.x + blockIdx.x * blockDim.x; + if (thread >= maxThreads) { + return; + } + // have one thread set the last element to 0; + int startOffset{ N - 1 }; + int right = -stride * thread + startOffset; + if (right >= 0) { + int separation = stride >> 1; + int left = right - separation; + int current = dev_idata[right]; + dev_idata[right] += dev_idata[left]; + dev_idata[left] = current; + } + } + inline int gridSize(int threads) { + return (threads + blockSize - 1) / blockSize; + } + /* Performs prefix-sum (aka scan) on idata, storing the result into odata. + */ + void efficientScan(int N, int d, int * dev_idata) + { + int iteration{ 1 }; + int maxThreads{ N >> 1 }; + for (int stride = { 2 }; stride <= N; stride *= 2) + { + int grids{ gridSize(maxThreads) }; + dim3 fullBlocksPerGrid(grids); + kernParallelReduction << > > + (N, stride, maxThreads, dev_idata); + ++iteration; + maxThreads >>= 1; + } + cudaMemset((dev_idata + N - 1), 0, sizeof(int)); + maxThreads = 1; + for (int stride = { N }; stride > 1; stride >>= 1) + { + int grids{ gridSize(maxThreads) }; + dim3 fullBlocksPerGrid(grids); + // printf(" %d %d %d\n", grids, maxThreads, stride); + kernDownSweep << > >(N, stride, + maxThreads, dev_idata); + --iteration; + maxThreads <<= 1; + } + } + void scan(int n, int *odata, const int *idata) + { + int * dev_idata; + // d is the number of scans needed and also the + // upper bound for log2 of the number of elements + int d{ ilog2ceil(n) }; // + int N{ 1 << d }; + initScan(N, n, idata, &dev_idata); + timer().startGpuTimer(); + efficientScan(N, d, dev_idata); + timer().endGpuTimer(); + // only transfer tho first n elements of the + // exclusive scan + transferScan(n, odata, dev_idata); + endScan(dev_idata); + } + + void initCompact(int N, int n, const int *idata, int ** dev_idata, + int **dev_booldata, int ** dev_indices, int **dev_odata) + { + int size{ sizeof(int) }; + cudaMalloc(reinterpret_cast (dev_booldata), N * size); + cudaMalloc(reinterpret_cast (dev_idata), N * size); + cudaMalloc(reinterpret_cast (dev_indices), N * size); + cudaMalloc(reinterpret_cast (dev_odata), N * size); + checkCUDAError("Allocating Compaction Scan Error"); + cudaMemset(*dev_idata, 0, N * size); + cudaMemcpy(*dev_idata, idata, n *size, cudaMemcpyHostToDevice); + // no need to initialize the odata because the loop does that each time + checkCUDAError("Initialize and Copy data to target Error"); + cudaThreadSynchronize(); + } + /** + * Performs stream compaction on idata, storing the result into odata. + * All zeroes are discarded. + * + * @param n The number of elements in idata. + * @param odata The array into which to store elements. + * @param idata The array of elements to compact. + * @returns The number of elements remaining after compaction. + */ + int compact(int n, int *odata, const int *idata) { + + int * dev_idata; + int * dev_booldata; + int * dev_indices; + int * dev_odata; + // d is the number of scans needed and also the + // upper bound for log2 of the number of elements + int d{ ilog2ceil(n) }; // + int N{ 1 << d }; + initCompact(N, n, idata, &dev_idata, &dev_booldata, &dev_indices, + &dev_odata); + timer().startGpuTimer(); + dim3 fullBlocksPerGrid((N + blockSize - 1) / blockSize); + + StreamCompaction::Common::kernMapToBoolean << > >(N, dev_booldata, dev_idata); + cudaMemcpy(dev_indices, dev_booldata, N * sizeof(int), + cudaMemcpyDeviceToDevice); + efficientScan(N, d, dev_indices); + StreamCompaction::Common::kernScatter << > >(N, dev_odata, dev_idata, + dev_booldata, dev_indices); + timer().endGpuTimer(); + int lastIndex; + transferScan(1, &lastIndex, dev_indices + N - 1); + int lastIncluded; + transferScan(1, &lastIncluded, dev_booldata + N - 1); + std::vector input(n); + std::vector bools(n); + std::vector indices(n); + transferScan(n, input.data(), dev_idata); + transferScan(n, bools.data(), dev_booldata); + transferScan(n, indices.data(), dev_indices); + printArray2(n, input.data(), true); + printArray2(n, bools.data(), true); + printArray2(n, indices.data(), true); + n = lastIncluded + lastIndex; + transferScan(n, odata, dev_odata); + printArray2(n, odata, true); + endScan(dev_odata); + endScan(dev_idata); + endScan(dev_indices); + endScan(dev_booldata); + return n; + } + } } diff --git a/stream_compaction/efficientOld.cu b/stream_compaction/efficientOld.cu new file mode 100644 index 0000000..6086197 --- /dev/null +++ b/stream_compaction/efficientOld.cu @@ -0,0 +1,193 @@ +#include +#include +#include "common.h" +#include "efficient.h" + +static const int blockSize{ 256}; + + + + + +namespace StreamCompaction { + namespace Efficient { + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; + } + + // initialize the arrays on the GPU + // returns the addresses of the pointers on the GPU + // dev_idata is a pointer to the address of the dev_idata array that + // gets updated here + // initialize dev_idata has the first + // elements copied and the remainder to make the stream 2^n + // are set to 0. The first input is the size of the arrays + // to allocate and the second input is the size of the array to transfer. + // N the maximum size of the allocated array. n is the size of the data array + // N is one more than the multiple of 2 greater or equal to n, + // in dev_idata, and then the elements are copied inte dev_idata. + + void initScan(int N, int n, const int *idata, int ** dev_idata) + { + int size {sizeof(int)}; + cudaMalloc((void**) (dev_idata), N * size); + checkCUDAError("Allocating Scan Buffer Efficient Error"); + cudaMemset(*dev_idata, 0, N * size); + cudaMemcpy(*dev_idata , idata, n *size, cudaMemcpyHostToDevice); + // no need to initialize the odata because the loop does that each time + checkCUDAError("Initialize and Copy data to target Error"); + cudaThreadSynchronize(); + } + // transfer scan data back to host + void transferScan(int N, int * odata, int * dev_odata) + { + cudaMemcpy(odata, dev_odata, N * sizeof(int), cudaMemcpyDeviceToHost); + } + + // end the scan on the device. + void endScan( int * dev_idata) + { + cudaFree(dev_idata); + } + // kernParallelReduction uses contiguous threads to do the parallel reduction + // There is one thread for every two elements + __global__ void kernParallelReduction(int N, int Stride, int * dev_idata) + { + int thread = threadIdx.x + blockIdx.x * blockDim.x; + int priorStride { Stride >> 1}; + int index = (thread + 1) * Stride -1; + if (index < N) { + dev_idata[index] += dev_idata[index - priorStride]; + } + } + // Downsweep uses contiguous threads to sweep down and add the intermediate + // results to the partial sums already computed + // There is one thread for every two elements. Here there is a for loop + // that changes the stride. Contiguous allows the first threads to do all + // the work and later warps will all be 0. + __global__ void kernDownSweep(int N, int stride, int * dev_idata) + { + int thread = threadIdx.x + blockIdx.x * blockDim.x; +// if ( 2 * thread >= N) { +// return; +// } + // have one thread set the last element to 0; + int startOffset { N - 1}; + int right = - stride * thread + startOffset; + if (right >= 0) { + int separation = stride >> 1; + int left = right - separation; + int current = dev_idata[right]; + dev_idata[right] += dev_idata[left]; + dev_idata[left] = current; + } + } +// __global__ void kernDownSweep(int N, int * dev_idata) +// { +// int thread = threadIdx.x + blockIdx.x * blockDim.x; +// if ( 2 * thread >= N) { +// return; +// } +// // have one thread set the last element to 0; +// int startOffset { N - 1}; +// if (thread == 0) { +// dev_idata[startOffset] = 0; +// } +// for ( int stride = {N}; stride > 1; stride >>= 1) +// { +// int right = - stride * thread + startOffset; +// if (right >= 0) { +// int separation = stride >> 1; +// int left = right - separation; +// int current = dev_idata[right]; +// dev_idata[right] += dev_idata[left]; +// dev_idata[left] = current; +// } +// __syncthreads(); +// } +// } + /** + * Performs prefix-sum (aka scan) on idata, storing the result into odata. + */ + void efficientScan(int N, int * dev_idata) + { + int Nthreads {N>>1}; + dim3 fullBlocksPerGrid((Nthreads + blockSize - 1)/ blockSize); + for ( int stride = {2}; stride <= N; stride *= 2) + { + kernParallelReduction<<>> + (N, stride, dev_idata); + } + cudaMemset((dev_idata + N - 1), 0, sizeof(int)); + for ( int stride = {N}; stride > 1; stride >>= 1) + { + kernDownSweep<<>>(N, stride, dev_idata); + } + } + void scan(int n, int *odata, const int *idata) + { + int * dev_idata; + // d is the number of scans needed and also the + // upper bound for log2 of the number of elements + int d {ilog2ceil(n)}; // + // add one so that the 0th element is 0 for the + // eclusive Scan + int N { 1 << d }; + initScan(N, n, idata, &dev_idata); + timer().startGpuTimer(); + efficientScan( N, dev_idata); + timer().endGpuTimer(); + // only transfer tho first n elements of the + // exclusive scan + transferScan(n, odata, dev_idata); + endScan(dev_idata); + } + + void initCompact(int N, int n, const int *idata, int ** dev_idata, int **dev_booldata) + { + int size {sizeof(int)}; + cudaMalloc(reinterpret_cast (dev_booldata), N * size); + cudaMalloc(reinterpret_cast (dev_idata), N * size); + checkCUDAError("Allocating Scan Buffer Scan Error"); + cudaMemset(*dev_idata, 0, N * size); + cudaMemcpy(*dev_idata , idata, n *size, cudaMemcpyHostToDevice); + // no need to initialize the odata because the loop does that each time + checkCUDAError("Initialize and Copy data to target Error"); + cudaThreadSynchronize(); + } + /** + * Performs stream compaction on idata, storing the result into odata. + * All zeroes are discarded. + * + * @param n The number of elements in idata. + * @param odata The array into which to store elements. + * @param idata The array of elements to compact. + * @returns The number of elements remaining after compaction. + */ + int compact(int n, int *odata, const int *idata) { + + int * dev_idata; + int * dev_booldata; + // d is the number of scans needed and also the + // upper bound for log2 of the number of elements + int d {ilog2ceil(n)}; // + // add one so that the 0th element is 0 for the + // eclusive Scan + int N { 1 << d }; + initCompact(N, n, idata, &dev_idata, &dev_booldata); + timer().startGpuTimer(); + efficientScan( N, dev_idata); + dim3 fullBlocksPerGrid((N + blockSize - 1)/ blockSize); + StreamCompaction::Common::kernMap…•”ŸŸŸŸŸŸŸŸŸŸ‰ + + timer().endGpuTimer(); + timer().startGpuTimer(); + + timer().endGpuTimer(); + return -1; + } + } +} diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 9218f8e..b973c86 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -1,25 +1,103 @@ -#include -#include -#include "common.h" -#include "naive.h" - -namespace StreamCompaction { - namespace Naive { +#include +#include +#include "common.h" +#include "naive.h" +/// threads per block +static const int blockSize{ 256 }; +namespace StreamCompaction { + namespace Naive { using StreamCompaction::Common::PerformanceTimer; PerformanceTimer& timer() { static PerformanceTimer timer; return timer; - } - // TODO: __global__ - - /** - * Performs prefix-sum (aka scan) on idata, storing the result into odata. - */ - void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); - } - } -} + } + + // initialize the arrays on the GPU + // returns the addresses of the pointers on the GPU + // dev_idata is a pointer to the address of the dev_idata array that + // gets updated here + // initialize the dev_odata to 0.0; dev_idata has the first + // elements copied and the remainder to make the stream 2^n + // are set to 0. The first input is the size of the arrays + // to allocate and the second input is the size of the array to transfer. + // N the maximum size of the allocated array. n is the size of the data array + // N is one more than the multiple of 2 greater or equal to n, + // 0 is placed at the first element + // in dev_idata, and then the elements are copied inte dev_idata. + // dev_odata has its first element initialized to 0 too. + void initScan(int N, int n, const int *idata, int ** dev_odata, int ** dev_idata) + { + int size {sizeof(int)}; + cudaMalloc(reinterpret_cast (dev_idata), N * size); + cudaMalloc(reinterpret_cast (dev_odata), N * size); + checkCUDAError("Allocating Scan Buffer Error"); + cudaMemset(*dev_idata, 0, N * size); + cudaMemset(*dev_odata, 0, N * size); + cudaMemcpy(*dev_idata + 1, idata, n *size, cudaMemcpyHostToDevice); + // no need to initialize the odata because the loop does that each time + checkCUDAError("Initialize and Copy data to target Error"); + cudaThreadSynchronize(); + } + // transfer scan data back to host + void transferScan(int N, int * odata, int * dev_odata) + { + cudaMemcpy(odata, dev_odata, N * sizeof(int), cudaMemcpyDeviceToHost); + } + + // end the scan on the device. + void endScan(int * dev_odata, int * dev_idata) + { + cudaFree(dev_idata); + cudaFree(dev_odata); + } + + + + // TODO: + __global__ void kernOneNaiveScan(int N, int pow2d_1, int * dev_odata, int * dev_idata) + { + int k = threadIdx.x + blockIdx.x * blockDim.x; + if (k >= N || (k < pow2d_1 >> 1)) { + return; + } + dev_odata[k] = dev_idata[k]; + if ( k >= pow2d_1) { + dev_odata[k] += dev_idata[k - pow2d_1]; + } + } + + + /** + * Performs prefix-sum (aka scan) on idata, storing the result into odata. + */ + void scan(int n, int *odata, const int *idata, int indx) + { + int * dev_idata; + int * dev_odata; + // d is the number of scans needed and also the + // upper bound for log2 of the number of elements + int d {ilog2ceil(n)}; // + // add one so that the 0th element is 0 for the + // eclusive Scan + int N { 1 << d }; + initScan(N + 1 , n, idata, &dev_odata, &dev_idata); + timer().startGpuTimer(); + dim3 fullBlocksPerGrid((N + blockSize - 1)/ blockSize); + // int final { 1 << d - 1}; + for (int pow2d_1 {1}; pow2d_1 < N; pow2d_1 *= 2) { + // copy all elements to dev_odata to save + kernOneNaiveScan<<>> + (N, pow2d_1, dev_odata + 1, dev_idata + 1); + std::swap(dev_odata, dev_idata); + if (pow2d_1 == indx) break; + } + timer().endGpuTimer(); + // only transfer tho first n elements of the + // exclusive scan + transferScan(n, odata, dev_idata); + endScan(dev_odata, dev_idata); + + } + } +} diff --git a/stream_compaction/naive.h b/stream_compaction/naive.h index 37dcb06..4d9bfd7 100644 --- a/stream_compaction/naive.h +++ b/stream_compaction/naive.h @@ -6,6 +6,6 @@ namespace StreamCompaction { namespace Naive { StreamCompaction::Common::PerformanceTimer& timer(); - void scan(int n, int *odata, const int *idata); + void scan(int n, int *odata, const int *idata, int indx = 100); } } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 36b732d..d738956 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -1,28 +1,35 @@ -#include -#include -#include -#include -#include -#include "common.h" -#include "thrust.h" - -namespace StreamCompaction { - namespace Thrust { +#include +#include +#include +#include +#include +#include "common.h" +#include "thrust.h" + +namespace StreamCompaction { + namespace Thrust { using StreamCompaction::Common::PerformanceTimer; PerformanceTimer& timer() { static PerformanceTimer timer; return timer; - } - /** - * Performs prefix-sum (aka scan) on idata, storing the result into odata. - */ - void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO use `thrust::exclusive_scan` - // example: for device_vectors dv_in and dv_out: - // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); - timer().endGpuTimer(); - } - } -} + } + + /** + * Performs prefix-sum (aka scan) on idata, storing the result into odata. + */ + // read about cuda host/device vectors. This way uses host and device + // vector iterators. + void scan(int n, int *odata, const int *idata) { + thrust::host_vector myvec(n); + thrust::copy(idata, idata + n, myvec.begin()); + thrust::device_vector dev_idata = myvec; + timer().startGpuTimer(); + thrust::exclusive_scan(dev_idata.begin(), dev_idata.end(), + dev_idata.begin()); + timer().endGpuTimer(); + myvec = dev_idata; + thrust::copy(myvec.begin(), myvec.end(), odata); + } + } +}