diff --git a/README.md b/README.md index b71c458..7e86f8e 100644 --- a/README.md +++ b/README.md @@ -3,11 +3,40 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* David Liao +* Tested on: Windows 7 Professional, Intel(R) Xeon(R) CPU E5-1630 v4 @ 3.70 GHz 3.70 GHz, GTX 1070 8192MB (SIG Lab) +* Also Tested on: Windows 7 Professional, Intel(R) i7 4770 CPU @ 3.4GHz, Quadro K600 (SIG Lab) -### (TODO: Your README) +### Analysis -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +![chart](images/chart.png) +#### Overall +All across the board, it appears that the CPU implementation lags behind all GPU implementations. The naive implementation trails behind the CPU implementation and the efficient GPU implementation beats the naive algorithm. Out of all 4 implementations, the Thrust implementation appears to be the fastest. It scales much better than the rest when given much larger arrays. +#### CPU +The cpu algorithm is a serialized sum iterated across the array. It is limited in computation by the fact that it is serial in nature. + +#### Naive Implementation +The naive implementation is the initial parallelized scan that does not take into account the access-sensitive nature of its computation. Thus it suffers from mostly the fact that it has to deal with two memory buffers (to swap between). It does however beat the Work-Efficient algorithm on smaller data-sets. This might be due to the fact that less kernal invocations are called, thus less memory transfers between CPU and GPU occurs and more global memory accesses. + +#### Work-Efficient Implementation +This is an upgraded algorithm that allows in place computation without the extra memory buffer. This helps speed things up greatly. The initial implementation launched n threads for every iteration of the loop when the vast majority of them in the first few iterations will not be doing anything. This caused the Work-Efficient implementation to actually perform worse than the naive implementation. Instead, a smarter implementation would only launch as many threads as needed for that iteration. This is calculated using a bitshift counter to indicate the number of threads to launch in the kernel invocation. I could not get timing improvements for this as anything past 1 << 13 using the inefficient approach would crash the machine (Quadro K600) I was working on. But after I implemented the hack for launching the correct number of threads, everything ran smoothly. + +#### Thrust +Thrust is the fastest implementation out of all 4. From NSight, it appears to used shared memory and is very efficient as compared to using global memory in our case which is very expensive. A quick search through thrust's git repo shows that they have multiple versions of scan that works on partitioned memory per warp or block. They also have an implementation working on global memory. Algorithm-wise, they appear to be using a similar up/down sweep algorithm. They also seem to use pick their block size very carefully based on empirical testing. Here's a [link](https://github.com/thrust/thrust/blob/2ef13096187b40a35a71451d09e49b14074b0859/thrust/system/cuda/detail/scan.inl) to their repo. + +### Nsight Results +#### Thrust +Row 1: Accumulate Tiles +Row 2: Exclusive Scan +Row 3: Downsweep +![thrust](images/nsight_thrust.PNG) +#### Work Efficient Implementation +![work efficient](images/nsight_efficient.PNG) + +#### Discussion +This is obtained on a sample input of 2^15 sized array. +Here we see thrust explicitly using dynamic shared memory. It has very efficient timings and high occupancy for Accumulate Tiles (as well as low register usage), but lower occupancy (and higher register usage) for the other two invocations. I think this is because it manages to schedule the sweeping functions in a way such that it only needs to launch a single kernel. It uses one launch for every operation as compared to the Work Efficient implementation where we see it launching 128, 64, 32, etc. kernels for each iteration of the loop. There's a lot of latency in-between the kernel launches. Despite this, each kernel invocation has a perfect occupancy percentage (yay! this means that our counting hack worked) and very low register usage. However, because it is sequential in launching the downsweep iterations and then the upsweep iterations, we take a hit in runtime. This on top shared memory usage makes thrust very fast. If we can maintain our high occupancy and low register usage, but manage to squeeze all of the kernal launches into a single one, we would be in great shape. + +### Test Results +![tests](images/testspassing.png) diff --git a/images/chart.png b/images/chart.png new file mode 100644 index 0000000..dcbe5f3 Binary files /dev/null and b/images/chart.png differ diff --git a/images/nsight_efficient.PNG b/images/nsight_efficient.PNG new file mode 100644 index 0000000..aa81871 Binary files /dev/null and b/images/nsight_efficient.PNG differ diff --git a/images/nsight_thrust.PNG b/images/nsight_thrust.PNG new file mode 100644 index 0000000..63446cb Binary files /dev/null and b/images/nsight_thrust.PNG differ diff --git a/images/testspassing.png b/images/testspassing.png new file mode 100644 index 0000000..504277f Binary files /dev/null and b/images/testspassing.png differ diff --git a/src/main.cpp b/src/main.cpp index 675da35..1e2dd0b 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -12,112 +12,151 @@ #include #include #include "testing_helpers.hpp" +#include +using namespace std::chrono; +#define RUNS 1 + +void runTimings() { + + high_resolution_clock::time_point start, end; + for (int i = 15; i <= 15; i+=2) { + const int SIZE = 1 << i; + int *a = new int[SIZE]; + int *b = new int[SIZE]; + zeroArray(SIZE, a); + zeroArray(SIZE, b); + + start = high_resolution_clock::now(); + StreamCompaction::CPU::scan(SIZE, b, a); + end = high_resolution_clock::now(); + duration duration = end - start; + printf("CPU scan: %f ms\n", duration.count() * 1000.0f); + + StreamCompaction::Naive::scan(SIZE, b, a); + StreamCompaction::Efficient::scan(SIZE, b, a); + StreamCompaction::Thrust::scan(SIZE, b, a); + + delete a; + delete b; + + + } +} + +void runTests() { + const int SIZE = 1 << 8; + const int NPOT = SIZE - 3; + int *a, *b, *c; + a = new int[SIZE]; + b = new int[SIZE]; + c = new int[SIZE]; + + // Scan tests + + printf("\n"); + printf("****************\n"); + printf("** SCAN TESTS **\n"); + printf("****************\n"); + + genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case + a[SIZE - 1] = 0; + printArray(SIZE, a, true); + + zeroArray(SIZE, b); + printDesc("cpu scan, power-of-two"); + StreamCompaction::CPU::scan(SIZE, b, a); + //printArray(SIZE, b, true); + //printf("%lf\n", Timer::getCPUTiming("cpu_scan")); + + zeroArray(SIZE, c); + printDesc("cpu scan, non-power-of-two"); + StreamCompaction::CPU::scan(NPOT, c, a); + //printArray(NPOT, b, true); + printCmpResult(NPOT, b, c); + + zeroArray(SIZE, c); + printDesc("naive scan, power-of-two"); + StreamCompaction::Naive::scan(SIZE, c, a); + printArray(SIZE, c, true); + printCmpResult(SIZE, b, c); + + zeroArray(SIZE, c); + printDesc("naive scan, non-power-of-two"); + StreamCompaction::Naive::scan(NPOT, c, a); + printArray(SIZE, c, true); + printCmpResult(NPOT, b, c); + + zeroArray(SIZE, c); + printDesc("work-efficient scan, power-of-two"); + StreamCompaction::Efficient::scan(SIZE, c, a); + printArray(SIZE, c, true); + printCmpResult(SIZE, b, c); + + zeroArray(SIZE, c); + printDesc("work-efficient scan, non-power-of-two"); + StreamCompaction::Efficient::scan(NPOT, c, a); + printArray(NPOT, c, true); + printCmpResult(NPOT, b, c); + + zeroArray(SIZE, c); + printDesc("thrust scan, power-of-two"); + StreamCompaction::Thrust::scan(SIZE, c, a); + //printArray(SIZE, c, true); + printCmpResult(SIZE, b, c); + + zeroArray(SIZE, c); + printDesc("thrust scan, non-power-of-two"); + StreamCompaction::Thrust::scan(NPOT, c, a); + //printArray(NPOT, c, true); + printCmpResult(NPOT, b, c); + + printf("\n"); + printf("*****************************\n"); + printf("** STREAM COMPACTION TESTS **\n"); + printf("*****************************\n"); + + // Compaction tests + + genArray(SIZE - 1, a, 4); // Leave a 0 at the end to test that edge case + a[SIZE - 1] = 0; + printArray(SIZE, a, true); + + int count, expectedCount, expectedNPOT; + + zeroArray(SIZE, b); + printDesc("cpu compact without scan, power-of-two"); + count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a); + expectedCount = count; + printArray(count, b, true); + printCmpLenResult(count, expectedCount, b, b); + + zeroArray(SIZE, c); + printDesc("cpu compact without scan, non-power-of-two"); + count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a); + expectedNPOT = count; + printArray(count, c, true); + printCmpLenResult(count, expectedNPOT, b, c); + + zeroArray(SIZE, c); + printDesc("cpu compact with scan"); + count = StreamCompaction::CPU::compactWithScan(SIZE, c, a); + printArray(count, c, true); + printCmpLenResult(count, expectedCount, b, c); + + zeroArray(SIZE, c); + printDesc("work-efficient compact, power-of-two"); + count = StreamCompaction::Efficient::compact(SIZE, c, a); + //printArray(count, c, true); + printCmpLenResult(count, expectedCount, b, c); + + zeroArray(SIZE, c); + printDesc("work-efficient compact, non-power-of-two"); + count = StreamCompaction::Efficient::compact(NPOT, c, a); + //printArray(count, c, true); + printCmpLenResult(count, expectedNPOT, b, c); +} int main(int argc, char* argv[]) { - const int SIZE = 1 << 8; - const int NPOT = SIZE - 3; - int a[SIZE], b[SIZE], c[SIZE]; - - // Scan tests - - printf("\n"); - printf("****************\n"); - printf("** SCAN TESTS **\n"); - printf("****************\n"); - - genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case - a[SIZE - 1] = 0; - printArray(SIZE, a, true); - - zeroArray(SIZE, b); - printDesc("cpu scan, power-of-two"); - StreamCompaction::CPU::scan(SIZE, b, a); - printArray(SIZE, b, true); - - zeroArray(SIZE, c); - printDesc("cpu scan, non-power-of-two"); - StreamCompaction::CPU::scan(NPOT, c, a); - printArray(NPOT, b, true); - printCmpResult(NPOT, b, c); - - zeroArray(SIZE, c); - printDesc("naive scan, power-of-two"); - StreamCompaction::Naive::scan(SIZE, c, a); - //printArray(SIZE, c, true); - printCmpResult(SIZE, b, c); - - zeroArray(SIZE, c); - printDesc("naive scan, non-power-of-two"); - StreamCompaction::Naive::scan(NPOT, c, a); - //printArray(SIZE, c, true); - printCmpResult(NPOT, b, c); - - zeroArray(SIZE, c); - printDesc("work-efficient scan, power-of-two"); - StreamCompaction::Efficient::scan(SIZE, c, a); - //printArray(SIZE, c, true); - printCmpResult(SIZE, b, c); - - zeroArray(SIZE, c); - printDesc("work-efficient scan, non-power-of-two"); - StreamCompaction::Efficient::scan(NPOT, c, a); - //printArray(NPOT, c, true); - printCmpResult(NPOT, b, c); - - zeroArray(SIZE, c); - printDesc("thrust scan, power-of-two"); - StreamCompaction::Thrust::scan(SIZE, c, a); - //printArray(SIZE, c, true); - printCmpResult(SIZE, b, c); - - zeroArray(SIZE, c); - printDesc("thrust scan, non-power-of-two"); - StreamCompaction::Thrust::scan(NPOT, c, a); - //printArray(NPOT, c, true); - printCmpResult(NPOT, b, c); - - printf("\n"); - printf("*****************************\n"); - printf("** STREAM COMPACTION TESTS **\n"); - printf("*****************************\n"); - - // Compaction tests - - genArray(SIZE - 1, a, 4); // Leave a 0 at the end to test that edge case - a[SIZE - 1] = 0; - printArray(SIZE, a, true); - - int count, expectedCount, expectedNPOT; - - zeroArray(SIZE, b); - printDesc("cpu compact without scan, power-of-two"); - count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a); - expectedCount = count; - printArray(count, b, true); - printCmpLenResult(count, expectedCount, b, b); - - zeroArray(SIZE, c); - printDesc("cpu compact without scan, non-power-of-two"); - count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a); - expectedNPOT = count; - printArray(count, c, true); - printCmpLenResult(count, expectedNPOT, b, c); - - zeroArray(SIZE, c); - printDesc("cpu compact with scan"); - count = StreamCompaction::CPU::compactWithScan(SIZE, c, a); - printArray(count, c, true); - printCmpLenResult(count, expectedCount, b, c); - - zeroArray(SIZE, c); - printDesc("work-efficient compact, power-of-two"); - count = StreamCompaction::Efficient::compact(SIZE, c, a); - //printArray(count, c, true); - printCmpLenResult(count, expectedCount, b, c); - - zeroArray(SIZE, c); - printDesc("work-efficient compact, non-power-of-two"); - count = StreamCompaction::Efficient::compact(NPOT, c, a); - //printArray(count, c, true); - printCmpLenResult(count, expectedNPOT, b, c); + runTests(); + //runTimings(); } diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index fe872d4..622547f 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -23,7 +23,8 @@ namespace Common { * which map to 0 will be removed, and elements which map to 1 will be kept. */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { - // TODO + int idx = threadIdx.x + (blockIdx.x * blockDim.x); + bools[idx] = idata[idx] != 0; } /** @@ -32,7 +33,11 @@ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { */ __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { - // TODO + int idx = threadIdx.x + (blockIdx.x * blockDim.x); + if (bools[idx]) { + odata[indices[idx]] = idata[idx]; + } + } } diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index e600c29..56b7313 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -8,8 +8,14 @@ namespace CPU { * CPU scan (prefix sum). */ void scan(int n, int *odata, const int *idata) { - // TODO - printf("TODO\n"); + if (n <= 0) { + return; + } + odata[0] = 0; + for (int i = 1; i < n; i++) { + odata[i] = odata[i - 1] + idata[i - 1]; + } + } /** @@ -18,8 +24,13 @@ void scan(int n, int *odata, const int *idata) { * @returns the number of elements remaining after compaction. */ int compactWithoutScan(int n, int *odata, const int *idata) { - // TODO - return -1; + int k = 0; + for (int i = 0; i < n; i++) { + if (idata[i]) { + odata[k++] = idata[i]; + } + } + return k; } /** @@ -28,8 +39,25 @@ int compactWithoutScan(int n, int *odata, const int *idata) { * @returns the number of elements remaining after compaction. */ int compactWithScan(int n, int *odata, const int *idata) { - // TODO - return -1; + int *scanResults = new int[n]; + + // mapping boolean function + for (int i = 0; i < n; i++) { + odata[i] = idata[i] != 0; + } + + //scan + scan(n, scanResults, odata); + + //compaction + int k = 0; + for (int i = 0; i < n; i++) { + if (idata[i]) { + k++; + odata[scanResults[i]] = idata[i]; + } + } + return k; } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index b2f739b..682d68d 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -3,17 +3,77 @@ #include "common.h" #include "efficient.h" +#define blockSize 128 namespace StreamCompaction { namespace Efficient { // TODO: __global__ +__global__ void kernUpSweep(int n, int offset, int *buf) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= (n >> offset)) return; + int idx = index << offset; + buf[idx + (1 << offset) - 1] += buf[idx + (1 << (offset - 1)) - 1]; +} + +__global__ void kernDownSweep(int n, int offset, int *buf) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= (n >> offset)) return; + int idx = index << offset; + int t = buf[idx + (1 << offset) - 1]; + buf[idx + (1 << offset) - 1] += buf[idx + (1 << (offset - 1)) - 1]; + buf[idx + (1 << (offset - 1)) - 1] = t; +} + + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - // TODO - printf("TODO\n"); + + int *buf; + int padded = 1 << ilog2ceil(n); + + cudaMalloc((void**)&buf, padded * sizeof(int)); + cudaMemcpy(buf, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + int offset; + int fullBlocksPerGrid = 0; + float total = 0; + float milliseconds = 0; + + cudaEvent_t start, end; + cudaEventCreate(&start); + cudaEventCreate(&end); + cudaEventRecord(start); + for (int i = 1; i <= ilog2(padded); i++) { + fullBlocksPerGrid = ((padded >> i) + blockSize - 1) / blockSize; + kernUpSweep << > >(padded, i, buf); + } + cudaEventRecord(end); + cudaEventSynchronize(end); + cudaEventElapsedTime(&milliseconds, start, end); + total += milliseconds; + + cudaMemset(buf + padded - 1, 0, sizeof(int)); + + cudaEventCreate(&start); + cudaEventCreate(&end); + cudaEventRecord(start); + for (int i = ilog2(padded); i >= 1; i--) { + fullBlocksPerGrid = ((padded >> i) + blockSize - 1) / blockSize; + kernDownSweep << > >(padded, i, buf); + } + + cudaEventRecord(end); + cudaEventSynchronize(end); + cudaEventElapsedTime(&milliseconds, start, end); + total += milliseconds; + printf("Work-Efficient scan: %f ms\n", total); + + cudaMemcpy(odata, buf, n * sizeof(int), cudaMemcpyDeviceToHost); + + cudaFree(buf); } /** @@ -26,8 +86,54 @@ void scan(int n, int *odata, const int *idata) { * @returns The number of elements remaining after compaction. */ int compact(int n, int *odata, const int *idata) { - // TODO - return -1; + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + int *bools, *indices, *in, *out; + + cudaMalloc((void**)&bools, n * sizeof(int)); + cudaMalloc((void**)&indices, n * sizeof(int)); + cudaMalloc((void**)&in, n * sizeof(int)); + cudaMalloc((void**)&out, n * sizeof(int)); + + cudaMemcpy(in, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + float total = 0; + float milliseconds = 0; + cudaEvent_t start, end; + cudaEventCreate(&start); + cudaEventCreate(&end); + cudaEventRecord(start); + + StreamCompaction::Common::kernMapToBoolean << > >(n, bools, in); + + cudaEventRecord(end); + cudaEventSynchronize(end); + cudaEventElapsedTime(&milliseconds, start, end); + total += milliseconds; + + cudaMemcpy(odata, bools, n * sizeof(int), cudaMemcpyDeviceToHost); + scan(n, odata, odata); + int lenCompacted = odata[n - 1]; + cudaMemcpy(indices, odata, n * sizeof(int), cudaMemcpyHostToDevice); + + cudaEventCreate(&start); + cudaEventCreate(&end); + cudaEventRecord(start); + + StreamCompaction::Common::kernScatter << > >(n, out, in, bools, indices); + + cudaEventRecord(end); + cudaEventSynchronize(end); + cudaEventElapsedTime(&milliseconds, start, end); + total += milliseconds; + printf("Work-Efficient Compact: %f ms\n", total); + cudaMemcpy(odata, out, n * sizeof(int), cudaMemcpyDeviceToHost); + + cudaFree(bools); + cudaFree(indices); + cudaFree(in); + cudaFree(out); + + return lenCompacted; } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 3d86b60..165bec4 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -2,18 +2,61 @@ #include #include "common.h" #include "naive.h" +#include + +#define blockSize 128 namespace StreamCompaction { namespace Naive { // TODO: __global__ + +__global__ void kernelScan(int offset, int n, int *dev_odata, int *dev_idata) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) return; + if (index < offset) { + dev_idata[index] = dev_odata[index]; + } + else { + dev_idata[index] = dev_odata[index - offset] + dev_odata[index]; + } +} /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - // TODO - printf("TODO\n"); + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + int offset; + int *dev_odata, *dev_idata; + cudaMalloc((void**)&dev_odata, n * sizeof(int)); + cudaMalloc((void**)&dev_idata, n * sizeof(int)); + + cudaMemcpy(dev_odata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + cudaEvent_t start, end; + cudaEventCreate(&start); + cudaEventCreate(&end); + cudaEventRecord(start); + + int log2n = ilog2ceil(n); + for (int i = 1; i <= log2n; i++) { + offset = 1 << (i - 1); + kernelScan << > >(offset, n, dev_odata, dev_idata); + std::swap(dev_odata, dev_idata); + } + + cudaEventRecord(end); + cudaEventSynchronize(end); + float milliseconds = 0; + cudaEventElapsedTime(&milliseconds, start, end); + printf("Naive scan: %f ms\n", milliseconds); + + cudaMemcpy(odata + 1, dev_odata, (n - 1) * sizeof(int), cudaMemcpyDeviceToHost); + odata[0] = 0; + + cudaFree(dev_odata); + cudaFree(dev_idata); } } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index d8dbb32..71e6c10 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -16,6 +16,24 @@ void scan(int n, int *odata, const int *idata) { // TODO use `thrust::exclusive_scan` // example: for device_vectors dv_in and dv_out: // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + + thrust::device_vector thrust_idata(idata, idata + n); + thrust::device_vector thrust_odata(odata, odata + n); + + cudaEvent_t start, end; + cudaEventCreate(&start); + cudaEventCreate(&end); + cudaEventRecord(start); + + thrust::exclusive_scan(thrust_idata.begin(), thrust_idata.end(), thrust_odata.begin()); + + cudaEventRecord(end); + cudaEventSynchronize(end); + float milliseconds = 0; + cudaEventElapsedTime(&milliseconds, start, end); + printf("Thrust scan: %f ms\n", milliseconds); + thrust::copy(thrust_odata.begin(), thrust_odata.end(), odata); + } }