diff --git a/README.md b/README.md index b71c458..a3860b3 100644 --- a/README.md +++ b/README.md @@ -3,11 +3,95 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Austin Eng +* Tested on: Windows 10, i7-4770K @ 3.50GHz 16GB, GTX 780 3072MB (Personal Computer) -### (TODO: Your README) +## Analysis -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +**Note: Reported graphs are the result of 100 trials, averaged. Also note that input sizes are at increasing powers of two. Furthermore, since the algorithm is exponential in growth, both axes are displayed at a log scale** +![Scan Analysis](https://docs.google.com/spreadsheets/d/1x1MppbyAceIIrwhDLsmV7unUYS2RYU_I20_wK0ReORY/pubchart?oid=175703576&format=image) + +![Compaction Analysis](https://docs.google.com/spreadsheets/d/1x1MppbyAceIIrwhDLsmV7unUYS2RYU_I20_wK0ReORY/pubchart?oid=477396612&format=image) + +For smaller input sizes, the CPU implementation for both Scan and Stream Compaction is much, much faster than the GPU implementation. When dealing with contiguous buffers of memory, the CPU reaps large benefits from cache which makes it very fast. However, at around 2^19 in input size, the more efficient GPU implementations begin to outperform the CPU. With only a single core, CPU performance becomes worse as the number of computations required increases exponentially. + +Meanwhile, on the GPU, the exponent of this algorithmic growth is divided by the number of cores so there is much slower growth. However, there is a larger cost from memory access so the GPU implementations are much slower for lower input sizes because of this memory overhead. Memory usage, however, increases linearly not exponentially, so for larger sets of data, the GPU wins with performance. + +In comparing the Naive and Efficient GPU implementations, we see that for smaller datasets, the Naive implementation is faster. This is probably because there are fewer kernel invocations are made. Even though there are an exponential number of additions, this still takes less time. However, as the input size increases, the much more computationally-efficient method performs better. + +I did not see much difference between power-of-two input sizes and non-power-of-two data sizes. This is likely because my implementation just increases the size of non-power-of-two inputs to be power-of-two inputs. + +### More Efficient `Efficient` Scan + +It turns out that my initial implementation of the Efficient scan was the extra credit implementation. Instead of launching the same number threads for the upsweep and downsweep, we decrease the number to avoid wasted threads and increase occupancy. Why? For the upsweep, after every iteration we need half as many threads. The others don't do anything. For the downsweep, our first iteration uses just 1 thread and each subsequent iteration doubles this number. A more efficient way to implement Efficient scan is to launch only the number of threads needed and have your calculated thread index jump by a power of two. So: `index = 2^d * (blockIdx.x * blockDim.x + threadIdx.x)`. Now our indicies will jump 2 -- 4 -- 6 -- 8 or 16 -- 32 -- 48 -- 64, etc. We can launch only the needed number of threads instead of launching `n` threads and using far, far less than half of them. + + +### Why is Thrust So Fast? + +It seems like the Thrust implementation receives a big performance boost from using shared memory. From the names of the function calls: `accumulate_tiles, exclusive_scan, exclusive_downsweep` it seems like Thrust is doing the same thing as the Efficient implementation except the `accumulate_tiles` calls have 32 and 4064 static and dynamic bytes of shared memory, respectively. `exclusive_scan`: 48 and 12240. `exclusive_downsweep`: 32 and 6880. This probably allows for much more efficient memory access in the kernel. Analysis also shows that each of the kernels is called twice, notably wrapped in `cuda_task` and `parallel_group`. This is probably done because the computation needs to be split into multiple pieces since shared memory can only be so large. + +## Test Output +``` +**************** +** SCAN TESTS ** +**************** + [ 38 19 38 37 5 47 15 35 0 12 3 0 42 ... 10 0 ] +==== cpu scan, power-of-two ==== +Elapsed: 10.0046ms + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 205473618 205473628 ] +==== cpu scan, non-power-of-two ==== +Elapsed: 9.0066ms + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 205473524 205473568 ] + passed +==== naive scan, power-of-two ==== +Elapsed: 9.708448ms + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 205473618 205473628 ] + passed +==== naive scan, non-power-of-two ==== +Elapsed: 9.713088ms + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 0 0 ] + passed +==== work-efficient scan, power-of-two ==== +Elapsed: 4.019968ms + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 205473618 205473628 ] + passed +==== work-efficient scan, non-power-of-two ==== +Elapsed: 3.999136ms + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 205473524 205473568 ] + passed +==== thrust scan, power-of-two ==== +Elapsed: 0.906560ms + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 205473618 205473628 ] + passed +==== thrust scan, non-power-of-two ==== +Elapsed: 1.042912ms + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 205473524 205473568 ] + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 2 3 2 1 3 1 1 1 2 0 1 0 2 ... 0 0 ] +==== cpu compact without scan, power-of-two ==== +Elapsed: 17.0074ms + [ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 2 2 ] + passed +==== cpu compact without scan, non-power-of-two ==== +Elapsed: 17.0071ms + [ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 2 2 ] + passed +==== cpu compact with scan ==== +Elapsed: 6.0037ms +Elapsed: 31.0118ms + [ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 2 2 ] + passed +==== work-efficient compact, power-of-two ==== +Elapsed: 5.496416ms + [ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 2 2 ] + passed +==== work-efficient compact, non-power-of-two ==== +Elapsed: 5.449856ms + [ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 2 2 ] + passed +``` \ No newline at end of file diff --git a/calc_stats.js b/calc_stats.js new file mode 100644 index 0000000..0110a97 --- /dev/null +++ b/calc_stats.js @@ -0,0 +1,39 @@ + +var fs = require("fs") + +fs.readFile(process.argv[2], function (err, data) { + + var stats = {} + + var output = data.toString(); + var re = /\n==== ([\s\S]+?) ====[^=]+Elapsed: ([\.\d]+)ms/g + outputs = output.split("SIZE: "); + + for (var i = 1; i < outputs.length; ++i) { + var out = outputs[i]; + var size = parseFloat(out.match(/\d+/)[0]) + var match = re.exec(out) + while (match != null) { + + if (!(size in stats)) { + console.log('initing', size) + stats[size] = new Object() + } + if (!(match[1] in stats[size])) { + console.log('initing', size, match[1]) + stats[size][match[1]] = [0, 0] + } + stats[size][match[1]][0] += 1 + stats[size][match[1]][1] += parseFloat(match[2]) + + match = re.exec(out) + } + } + for (var i in stats) { + for (var j in stats[i]) { + stats[i][j] = stats[i][j][1] / stats[i][j][0] + } + } + console.log(stats) + +}) \ No newline at end of file diff --git a/src/main.cpp b/src/main.cpp index 675da35..e651f0a 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -13,111 +13,130 @@ #include #include "testing_helpers.hpp" +void test(int SIZE) { + const int NPOT = SIZE - 3; + int* a = new int[SIZE]; + int* b = new int[SIZE]; + int* c = new int[SIZE]; + + // Scan tests + + printf("\n"); + printf("****************\n"); + printf("** SCAN TESTS **\n"); + printf("****************\n"); + + genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case + a[SIZE - 1] = 0; + printArray(SIZE, a, true); + + zeroArray(SIZE, b); + printDesc("cpu scan, power-of-two"); + StreamCompaction::CPU::scan(SIZE, b, a); + printArray(SIZE, b, true); + + zeroArray(SIZE, c); + printDesc("cpu scan, non-power-of-two"); + StreamCompaction::CPU::scan(NPOT, c, a); + printArray(NPOT, b, true); + printCmpResult(NPOT, b, c); + + zeroArray(SIZE, c); + printDesc("naive scan, power-of-two"); + StreamCompaction::Naive::scan(SIZE, c, a); + printArray(SIZE, c, true); + printCmpResult(SIZE, b, c); + + zeroArray(SIZE, c); + printDesc("naive scan, non-power-of-two"); + StreamCompaction::Naive::scan(NPOT, c, a); + printArray(SIZE, c, true); + printCmpResult(NPOT, b, c); + + zeroArray(SIZE, c); + printDesc("work-efficient scan, power-of-two"); + StreamCompaction::Efficient::scan(SIZE, c, a); + printArray(SIZE, c, true); + printCmpResult(SIZE, b, c); + + zeroArray(SIZE, c); + printDesc("work-efficient scan, non-power-of-two"); + StreamCompaction::Efficient::scan(NPOT, c, a); + printArray(NPOT, c, true); + printCmpResult(NPOT, b, c); + + zeroArray(SIZE, c); + printDesc("thrust scan, power-of-two"); + StreamCompaction::Thrust::scan(SIZE, c, a); + printArray(SIZE, c, true); + printCmpResult(SIZE, b, c); + + zeroArray(SIZE, c); + printDesc("thrust scan, non-power-of-two"); + StreamCompaction::Thrust::scan(NPOT, c, a); + printArray(NPOT, c, true); + printCmpResult(NPOT, b, c); + + printf("\n"); + printf("*****************************\n"); + printf("** STREAM COMPACTION TESTS **\n"); + printf("*****************************\n"); + + // Compaction tests + + genArray(SIZE - 1, a, 4); // Leave a 0 at the end to test that edge case + a[SIZE - 1] = 0; + printArray(SIZE, a, true); + + int count, expectedCount, expectedNPOT; + + zeroArray(SIZE, b); + printDesc("cpu compact without scan, power-of-two"); + count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a); + expectedCount = count; + printArray(count, b, true); + printCmpLenResult(count, expectedCount, b, b); + + zeroArray(SIZE, c); + printDesc("cpu compact without scan, non-power-of-two"); + count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a); + expectedNPOT = count; + printArray(count, c, true); + printCmpLenResult(count, expectedNPOT, b, c); + + zeroArray(SIZE, c); + printDesc("cpu compact with scan"); + count = StreamCompaction::CPU::compactWithScan(SIZE, c, a); + printArray(count, c, true); + printCmpLenResult(count, expectedCount, b, c); + + zeroArray(SIZE, c); + printDesc("work-efficient compact, power-of-two"); + count = StreamCompaction::Efficient::compact(SIZE, c, a); + printArray(count, c, true); + printCmpLenResult(count, expectedCount, b, c); + + zeroArray(SIZE, c); + printDesc("work-efficient compact, non-power-of-two"); + count = StreamCompaction::Efficient::compact(NPOT, c, a); + printArray(count, c, true); + printCmpLenResult(count, expectedNPOT, b, c); + + delete a; + delete b; + delete c; +} + int main(int argc, char* argv[]) { - const int SIZE = 1 << 8; - const int NPOT = SIZE - 3; - int a[SIZE], b[SIZE], c[SIZE]; - - // Scan tests - - printf("\n"); - printf("****************\n"); - printf("** SCAN TESTS **\n"); - printf("****************\n"); - - genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case - a[SIZE - 1] = 0; - printArray(SIZE, a, true); - - zeroArray(SIZE, b); - printDesc("cpu scan, power-of-two"); - StreamCompaction::CPU::scan(SIZE, b, a); - printArray(SIZE, b, true); - - zeroArray(SIZE, c); - printDesc("cpu scan, non-power-of-two"); - StreamCompaction::CPU::scan(NPOT, c, a); - printArray(NPOT, b, true); - printCmpResult(NPOT, b, c); - - zeroArray(SIZE, c); - printDesc("naive scan, power-of-two"); - StreamCompaction::Naive::scan(SIZE, c, a); - //printArray(SIZE, c, true); - printCmpResult(SIZE, b, c); - - zeroArray(SIZE, c); - printDesc("naive scan, non-power-of-two"); - StreamCompaction::Naive::scan(NPOT, c, a); - //printArray(SIZE, c, true); - printCmpResult(NPOT, b, c); - - zeroArray(SIZE, c); - printDesc("work-efficient scan, power-of-two"); - StreamCompaction::Efficient::scan(SIZE, c, a); - //printArray(SIZE, c, true); - printCmpResult(SIZE, b, c); - - zeroArray(SIZE, c); - printDesc("work-efficient scan, non-power-of-two"); - StreamCompaction::Efficient::scan(NPOT, c, a); - //printArray(NPOT, c, true); - printCmpResult(NPOT, b, c); - - zeroArray(SIZE, c); - printDesc("thrust scan, power-of-two"); - StreamCompaction::Thrust::scan(SIZE, c, a); - //printArray(SIZE, c, true); - printCmpResult(SIZE, b, c); - - zeroArray(SIZE, c); - printDesc("thrust scan, non-power-of-two"); - StreamCompaction::Thrust::scan(NPOT, c, a); - //printArray(NPOT, c, true); - printCmpResult(NPOT, b, c); - - printf("\n"); - printf("*****************************\n"); - printf("** STREAM COMPACTION TESTS **\n"); - printf("*****************************\n"); - - // Compaction tests - - genArray(SIZE - 1, a, 4); // Leave a 0 at the end to test that edge case - a[SIZE - 1] = 0; - printArray(SIZE, a, true); - - int count, expectedCount, expectedNPOT; - - zeroArray(SIZE, b); - printDesc("cpu compact without scan, power-of-two"); - count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a); - expectedCount = count; - printArray(count, b, true); - printCmpLenResult(count, expectedCount, b, b); - - zeroArray(SIZE, c); - printDesc("cpu compact without scan, non-power-of-two"); - count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a); - expectedNPOT = count; - printArray(count, c, true); - printCmpLenResult(count, expectedNPOT, b, c); - - zeroArray(SIZE, c); - printDesc("cpu compact with scan"); - count = StreamCompaction::CPU::compactWithScan(SIZE, c, a); - printArray(count, c, true); - printCmpLenResult(count, expectedCount, b, c); - - zeroArray(SIZE, c); - printDesc("work-efficient compact, power-of-two"); - count = StreamCompaction::Efficient::compact(SIZE, c, a); - //printArray(count, c, true); - printCmpLenResult(count, expectedCount, b, c); - - zeroArray(SIZE, c); - printDesc("work-efficient compact, non-power-of-two"); - count = StreamCompaction::Efficient::compact(NPOT, c, a); - //printArray(count, c, true); - printCmpLenResult(count, expectedNPOT, b, c); + int SIZE = 1 << 23; + test(SIZE); + /*for (int pow = 4; pow < 24; ++pow) { + int SIZE = 1 << pow; + printf("====== SIZE: %d ======\n", SIZE); + + for (int i = 0; i < 100; ++i) { + test(SIZE); + } + }*/ + } diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt index cdbef77..cee1b43 100644 --- a/stream_compaction/CMakeLists.txt +++ b/stream_compaction/CMakeLists.txt @@ -13,5 +13,5 @@ set(SOURCE_FILES cuda_add_library(stream_compaction ${SOURCE_FILES} - OPTIONS -arch=sm_20 + OPTIONS -arch=sm_35 ) diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index fe872d4..8b4fbad 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -23,7 +23,9 @@ namespace Common { * which map to 0 will be removed, and elements which map to 1 will be kept. */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { - // TODO + int index = blockIdx.x * blockDim.x + threadIdx.x; + if (index >= n) return; + bools[index] = idata[index] != 0; } /** @@ -32,7 +34,11 @@ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { */ __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { - // TODO + int index = blockIdx.x * blockDim.x + threadIdx.x; + if (index >= n) return; + if (bools[index]) { + odata[indices[index]] = idata[index]; + } } } diff --git a/stream_compaction/common.h b/stream_compaction/common.h index 4f52663..91f358d 100644 --- a/stream_compaction/common.h +++ b/stream_compaction/common.h @@ -33,3 +33,18 @@ namespace Common { const int *idata, const int *bools, const int *indices); } } + +#define START_CUDA_TIMER() \ + cudaEvent_t start, stop; \ + cudaEventCreate(&start); \ + cudaEventCreate(&stop); \ + cudaEventRecord(start); + +#define STOP_CUDA_TIMER() \ + cudaEventRecord(stop); \ + cudaEventSynchronize(stop); \ + float milliseconds = 0; \ + cudaEventElapsedTime(&milliseconds, start, stop); \ + cudaEventDestroy(start); \ + cudaEventDestroy(stop); \ + printf("Elapsed: %fms\n", milliseconds); diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index e600c29..f71b0f7 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -1,4 +1,6 @@ #include +#include +#include #include "cpu.h" namespace StreamCompaction { @@ -8,8 +10,15 @@ namespace CPU { * CPU scan (prefix sum). */ void scan(int n, int *odata, const int *idata) { - // TODO - printf("TODO\n"); + auto begin = std::chrono::high_resolution_clock::now(); + int total = 0; + for (int i = 0; i < n; ++i) { + int val = idata[i]; + odata[i] = total; + total += val; + } + auto end = std::chrono::high_resolution_clock::now(); + std::cout << "Elapsed: " << std::chrono::duration_cast(end - begin).count() / 1000000.f << "ms" << std::endl; } /** @@ -18,8 +27,14 @@ void scan(int n, int *odata, const int *idata) { * @returns the number of elements remaining after compaction. */ int compactWithoutScan(int n, int *odata, const int *idata) { - // TODO - return -1; + auto begin = std::chrono::high_resolution_clock::now(); + int idx = 0; + for (int i = 0; i < n; ++i) { + if (idata[i] != 0) odata[idx++] = idata[i]; + } + auto end = std::chrono::high_resolution_clock::now(); + std::cout << "Elapsed: " << std::chrono::duration_cast(end - begin).count() / 1000000.f << "ms" << std::endl; + return idx; } /** @@ -28,8 +43,19 @@ int compactWithoutScan(int n, int *odata, const int *idata) { * @returns the number of elements remaining after compaction. */ int compactWithScan(int n, int *odata, const int *idata) { - // TODO - return -1; + auto begin = std::chrono::high_resolution_clock::now(); + for (int i = 0; i < n; ++i) { + odata[i] = idata[i] != 0 ? 1 : 0; + } + int last = odata[n - 1]; + scan(n, odata, odata); + int count = odata[n - 1] + last; + for (int i = 0; i < n; ++i) { + odata[odata[i]] = idata[i]; + } + auto end = std::chrono::high_resolution_clock::now(); + std::cout << "Elapsed: " << std::chrono::duration_cast(end - begin).count() / 1000000.f << "ms" << std::endl; + return count; } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index b2f739b..02627cc 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -3,17 +3,86 @@ #include "common.h" #include "efficient.h" +__global__ void printArr(int n, const int* data) { + printf(" [ "); + for (int i = 0; i < n; ++i) { + printf("%3d ", data[i]); + } + printf("]\n"); +} + namespace StreamCompaction { namespace Efficient { -// TODO: __global__ +__global__ void scan_up(int n, int pow2d, int *odata, const int *idata) { + int index = 2 * pow2d * (blockIdx.x * blockDim.x + threadIdx.x + 1) - 1; + if (index >= n) return; + + // set last value to 0 here to avoid cudaMemcpy + if (index == n - 1) { + odata[index] = 0; + } else { + odata[index] = idata[index - pow2d] + idata[index]; + } +} + +__global__ void scan_down(int n, int pow2d, int *odata, const int *idata) { + int index = n - 2 * pow2d * (blockIdx.x * blockDim.x + threadIdx.x) - 1; + if (index < 0) return; + + int temp = idata[index - pow2d]; + odata[index - pow2d] = idata[index]; + odata[index] = temp + idata[index]; +} + +__global__ void zero(int n, int *odata) { + int index = blockIdx.x * blockDim.x + threadIdx.x; + if (index >= n) return; + odata[index] = 0; +} /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - // TODO - printf("TODO\n"); + int* dev_data; + int sizePow2 = pow(2, ilog2ceil(n)); + + int blockSize = 128; + int nBlocks; + + cudaMalloc((void**)&dev_data, sizePow2 * sizeof(int)); + checkCUDAError("cudaMalloc dev_data failed!"); + + blockSize = 32; + // fill with 0 + nBlocks = (sizePow2 + blockSize - 1) / blockSize; + zero << > >(sizePow2, dev_data); + + cudaMemcpy(dev_data, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy from idata to dev_data failed!"); + + blockSize = 128; + START_CUDA_TIMER() + // scan up + for (int pow2d = 1, int threads = sizePow2; pow2d < sizePow2 / 2; pow2d *= 2, threads /= 2) { + nBlocks = (threads + blockSize - 1) / blockSize; // threads / blockSize, rounded up + scan_up << < nBlocks, blockSize >> >(sizePow2, pow2d, dev_data, dev_data); + } + + blockSize = 128; + // scan down + for (int pow2d = pow(2, ilog2ceil(sizePow2) - 1), int threads = 1; pow2d >= 1; pow2d /= 2, threads *= 2) { + nBlocks = (threads + blockSize - 1) / blockSize; // threads / blockSize, rounded up + scan_down << < nBlocks, blockSize >> >(sizePow2, pow2d, dev_data, dev_data); + } + STOP_CUDA_TIMER() + + cudaMemcpy(odata, dev_data, sizeof(int) * n, cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy from dev_data to odata failed!"); + + cudaFree(dev_data); + checkCUDAError("cudaFree dev_data failed!"); } /** @@ -26,8 +95,81 @@ void scan(int n, int *odata, const int *idata) { * @returns The number of elements remaining after compaction. */ int compact(int n, int *odata, const int *idata) { - // TODO - return -1; + int* dev_idata; + int* dev_odata; + int* dev_mask; + int sizePow2 = pow(2, ilog2ceil(n)); + + const int blockSize = 128; + int nBlocks; + + cudaMalloc((void**)&dev_idata, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_idata failed!"); + + cudaMalloc((void**)&dev_odata, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_odata failed!"); + + cudaMalloc((void**)&dev_mask, sizePow2 * sizeof(int)); + checkCUDAError("cudaMalloc dev_idata failed!"); + + cudaMemcpy(dev_idata, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy from idata to dev_data failed!"); + + // printArr << <1, 1 >> >(n, dev_idata); + START_CUDA_TIMER() + // create mask + nBlocks = (n + blockSize - 1) / blockSize; + StreamCompaction::Common::kernMapToBoolean << > >(n, dev_mask, dev_idata); + + // save the mask here for usage later + cudaMemcpy(dev_odata, dev_mask, sizeof(int) * n, cudaMemcpyDeviceToDevice); + checkCUDAError("cudaMemcpy from dev_mask to dev_odata failed!"); + + // printArr << <1, 1 >> >(n, dev_mask); + + int endsWith1; + cudaMemcpy(&endsWith1, &dev_mask[n - 1], sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy copy to endswith1 failed!"); + + // scan up + for (int pow2d = 1, int threads = sizePow2; pow2d < sizePow2 / 2; pow2d *= 2, threads /= 2) { + nBlocks = (threads + blockSize - 1) / blockSize; // threads / blockSize, rounded up + scan_up << < nBlocks, blockSize >> >(n, pow2d, dev_mask, dev_mask); + } + + // scan down + for (int pow2d = pow(2, ilog2ceil(sizePow2) - 1), int threads = 1; pow2d >= 1; pow2d /= 2, threads *= 2) { + nBlocks = (threads + blockSize - 1) / blockSize; // threads / blockSize, rounded up + scan_down << < nBlocks, blockSize >> >(sizePow2, pow2d, dev_mask, dev_mask); + } + + int last; + // copy back last val so we know how many elements + cudaMemcpy(&last, &dev_mask[sizePow2 - 1], sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy copy last failed!"); + last += endsWith1; // increment if we should include the very last element + + // printArr << <1, 1 >> >(n, dev_mask); + + // scatter + nBlocks = (n + blockSize - 1) / blockSize; + StreamCompaction::Common::kernScatter << > >(n, dev_odata, dev_idata, dev_odata, dev_mask); + + STOP_CUDA_TIMER() + + cudaMemcpy(odata, dev_odata, sizeof(int) * last, cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy from dev_odata to odata failed!"); + + cudaFree(dev_idata); + checkCUDAError("cudaFree dev_idata failed!"); + + cudaFree(dev_odata); + checkCUDAError("cudaFree dev_odata failed!"); + + cudaFree(dev_mask); + checkCUDAError("cudaFree dev_mask failed!"); + + return last; } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 3d86b60..2b4f2a9 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -1,19 +1,81 @@ #include #include +#include #include "common.h" #include "naive.h" namespace StreamCompaction { namespace Naive { -// TODO: __global__ +__global__ void printArr(int n, const int* data) { + printf(" [ "); + for (int i = 0; i < n; ++i) { + printf("%3d ", data[i]); + } + printf("]\n"); +} + +/* + * Performs one iteration of a naive scan of N elements. pow2d = 2^depth + */ +__global__ void naiveScanIteration(int N, int pow2d, int* odata, const int* idata) { + int index = blockIdx.x * blockDim.x + threadIdx.x; + if (index >= N) return; + if (index >= pow2d) { + odata[index] = idata[index] + idata[index - pow2d]; + } + else { + // we've already processed these elements. just copy them + odata[index] = idata[index]; + } +} + +__global__ void rshift(int n, int* odata, const int* idata) { + int index = blockIdx.x * blockDim.x + threadIdx.x; + if (index >= n) return; + odata[index] = index == 0 ? 0 : idata[index - 1]; +} /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - // TODO - printf("TODO\n"); + int* dev_idata; + int* dev_odata; + cudaMalloc((void**)&dev_idata, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_idata failed!"); + + cudaMalloc((void**)&dev_odata, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_odata failed!"); + + cudaMemcpy(dev_idata, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy from idata to dev_data failed!"); + + const int blockSize = 128; + const int nBlocks = (n + blockSize - 1) / blockSize; // n/blockSize, rounded up + + START_CUDA_TIMER() + + for (int pow2d = 1; pow2d < n; pow2d *= 2) { + naiveScanIteration << < nBlocks, blockSize >> >(n, pow2d, dev_odata, dev_idata); + std::swap(dev_idata, dev_odata); + } + + // convert to exclusive scan + rshift << > >(n, dev_odata, dev_idata); + + STOP_CUDA_TIMER() + + + // we use dev_idata here because we swapped buffers + cudaMemcpy(odata, dev_odata, sizeof(int) * n, cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy from dev_data to odata failed!"); + + cudaFree(dev_idata); + checkCUDAError("cudaFree dev_idata failed!"); + + cudaFree(dev_odata); + checkCUDAError("cudaFree dev_odata failed!"); } } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index d8dbb32..50d3c8c 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -13,9 +13,12 @@ namespace Thrust { * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - // TODO use `thrust::exclusive_scan` - // example: for device_vectors dv_in and dv_out: - // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + thrust::device_vector dv_in(idata, idata + n); + thrust::device_vector dv_out(odata, odata + n); + START_CUDA_TIMER() + thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + STOP_CUDA_TIMER() + thrust::copy(dv_out.begin(), dv_out.end(), odata); } }