diff --git a/README.md b/README.md index b71c458..ea0e0d7 100644 --- a/README.md +++ b/README.md @@ -3,11 +3,102 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Carolina Zheng +* Tested on: Windows 7, i7-6700 @ 3.40GHz 16GB, Quadro K620 (Moore 100 Lab) -### (TODO: Your README) +### Performance Analysis -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +I determined the runtimes by running each implementation on arrays seeded with fixed values and averaging the times over 10 runs. +#### Block Size vs. Performance + +![](img/block-size.png) + + +The optimal block size for the naive implementation was 64 threads vs. 32 threads for the work-efficient implementation. I was surprised by the fact that a small block size was optimal: I would have expected smaller block sizes to decrease performance because it would max out the block slots on each SM before maxing out the thread slots. The runtime difference between block sizes was very small (<1 ms), however, which makes me think that block size didn't factor in too much overall. + +In terms of occupancy, the limiting factor was likely thread slots. My kernel functions didn't use shared memory and used few registers. + +#### Scan Implementation vs. Performance + +![](img/scan-runtime.png) + + +| Array size (2^x) | CPU | Naive | Efficient | Thrust | +|:----------------:| ---:|------:|----------:|-------:| +| 4 | 0.00033 | 0.0453 | 0.0694 | 1.183 | +| 8 | 0.00228 | 0.0712 | 0.127 | 1.106 | +| 12 | 0.00982 | 0.175 | 0.2007 | 2.002 | +| 16 | 0.166 | 2.091 | 0.595 | 12.76 | +| 18 | 0.853 | 8.757 | 1.528 | 25.78 | +| 20 | 2.56 | 37.38 | 5.82 | 78.57 | +| 21 | 4.96 | 77.58 | 11.43 | 148.22 | +| 22 | 10.28 | - | - | 289.54 | + + +Overall, the CPU implementation had the best performance, following by my work-efficient GPU implementation, then the naive GPU implementation, and finally the thrust library call. + +The results for work-efficient vs. naive were as expected. On small arrays, the naive implementation was faster. However, once the array size increased from 2^12 to 2^16, the naive scan jumped in runtime by a factor of 20, while the efficient runtime only increased by a factor of 3. As the array size kept doubling, the two implementations' performance deteriorated by about the same multiplier, but because of this initial jump, the efficient scan outperformed the naive scan by a factor of about 7 for array sizes of 2^18 to 2^21. My GPU implementations failed to run on array size 2^22. + +The CPU implementation's runtime increased rather smoothly as array size increased, and did not exhibit any jumps. + +The thrust library call was much slower than the other implementations: on large array sizes, it was twice as slow as the naive scan and slower by a factor of over 10 compared to the work-efficient scan. It scaled well to increasing array size. + +I believe that the performance bottlenecks were different for CPU vs. GPU implementations vs. thrust. For the CPU, computation time was probably the bottleneck due to the fact that all operations had to be performed serially. For the naive and work-efficient scans, I think global memory access was the bottleneck. GPU compute capability is much better than memory accesses, and the kernels I wrote in particular were independent of each other and consisted of only a few floating-point computations, as well as global memory reads and writes. I also used hardware-optimized branching, which should minimize divergence within warps. I was surprised at how slow the thrust library call was. It's likely that the library call is using an optimized algorithm under the hood, but may have been slowed down due to layers of other function calls, error checking, etc. It's hard to say exactly why it was slow without examining the source code. + +To mitigate the bottlenecks of my GPU implementations, I could improve my code by utilizing shared memory instead of global memory. I'm also interested in whether moving the loops from the host to the device would increase performance. Also, I could refactor my work-efficient stream compaction so that there is no memory copying between host and device, which might make it faster than the CPU compaction. + +#### Test Program Output +The following is a sample test program output from `main.cpp` for an array of size 2^20. + +``` +**************** +** SCAN TESTS ** +**************** + [ 1 42 48 47 9 36 13 34 28 39 3 2 38 ... 1 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 2.4675ms (std::chrono Measured) +==== cpu scan, non-power-of-two ==== + elapsed time: 2.47652ms (std::chrono Measured) + passed +==== naive scan, power-of-two ==== + elapsed time: 38.5892ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + elapsed time: 37.0342ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 6.30717ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 6.17421ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + elapsed time: 84.909ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 78.5352ms (CUDA Measured) + passed + + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 0 0 0 3 1 0 1 3 1 0 0 2 0 ... 2 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 4.67753ms (std::chrono Measured) + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 4.78749ms (std::chrono Measured) + passed +==== cpu compact with scan ==== + elapsed time: 12.9883ms (std::chrono Measured) + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 13.6215ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 13.4029ms (CUDA Measured) + passed +Press any key to continue . . . +``` diff --git a/img/block-size.png b/img/block-size.png new file mode 100644 index 0000000..9bcdf72 Binary files /dev/null and b/img/block-size.png differ diff --git a/img/scan-runtime.png b/img/scan-runtime.png new file mode 100644 index 0000000..965df77 Binary files /dev/null and b/img/scan-runtime.png differ diff --git a/src/main.cpp b/src/main.cpp index 7305641..c0aa9e0 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -13,7 +13,8 @@ #include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array +const int SIZE = 1 << 20; // feel free to change the size of array +//const int SIZE = 8; const int NPOT = SIZE - 3; // Non-Power-Of-Two int a[SIZE], b[SIZE], c[SIZE]; @@ -29,20 +30,74 @@ int main(int argc, char* argv[]) { a[SIZE - 1] = 0; printArray(SIZE, a, true); - // initialize b using StreamCompaction::CPU::scan you implement - // We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct. - // At first all cases passed because b && c are all zeroes. + float cpuTime = 0; + float naiveTime = 0; + float efficientTime = 0; + float thrustTime = 0; + + //for (int i = 0; i < 10; i++) + //{ + // genArray(SIZE - 1, a, 50, i); // Leave a 0 at the end to test that edge case + // a[SIZE - 1] = 0; + // printArray(SIZE, a, true); + + // zeroArray(SIZE, b); + // //printDesc("cpu scan, power-of-two"); + // StreamCompaction::CPU::scan(SIZE, b, a); + // //printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); + // //printArray(SIZE, b, true); + + // zeroArray(SIZE, c); + // //printDesc("naive scan, power-of-two"); + // StreamCompaction::Naive::scan(SIZE, c, a); + // //printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + // //printArray(SIZE, c, true); + // //printCmpResult(SIZE, b, c); + + // zeroArray(SIZE, c); + // //printDesc("work-efficient scan, power-of-two"); + // StreamCompaction::Efficient::scan(SIZE, c, a); + // //printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + // //printArray(SIZE, c, true); + // //printCmpResult(SIZE, b, c); + + // zeroArray(SIZE, c); + // //printDesc("thrust scan, power-of-two"); + // StreamCompaction::Thrust::scan(SIZE, c, a); + // //printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + // //printArray(SIZE, c, true); + // //printCmpResult(SIZE, b, c); + + // cpuTime += StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(); + // naiveTime += StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(); + // efficientTime += StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(); + // thrustTime += StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(); + //} + + //cpuTime /= 10.f; + //naiveTime /= 10.f; + //efficientTime /= 10.f; + //thrustTime /= 10.f; + + //printf("CPU time: %f\n", cpuTime); + //printf("Naive time: %f\n", naiveTime); + //printf("Efficient time: %f\n", efficientTime); + //printf("Thrust time: %f\n", thrustTime); + + //initialize b using StreamCompaction::CPU::scan you implement + //We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct. + //At first all cases passed because b && c are all zeroes. zeroArray(SIZE, b); printDesc("cpu scan, power-of-two"); StreamCompaction::CPU::scan(SIZE, b, a); printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - printArray(SIZE, b, true); + //printArray(SIZE, b, true); zeroArray(SIZE, c); printDesc("cpu scan, non-power-of-two"); StreamCompaction::CPU::scan(NPOT, c, a); printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - printArray(NPOT, b, true); + //printArray(NPOT, b, true); printCmpResult(NPOT, b, c); zeroArray(SIZE, c); @@ -100,14 +155,14 @@ int main(int argc, char* argv[]) { int count, expectedCount, expectedNPOT; - // initialize b using StreamCompaction::CPU::compactWithoutScan you implement - // We use b for further comparison. Make sure your StreamCompaction::CPU::compactWithoutScan is correct. + //initialize b using StreamCompaction::CPU::compactWithoutScan you implement + //We use b for further comparison. Make sure your StreamCompaction::CPU::compactWithoutScan is correct. zeroArray(SIZE, b); printDesc("cpu compact without scan, power-of-two"); count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a); printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); expectedCount = count; - printArray(count, b, true); + //printArray(count, b, true); printCmpLenResult(count, expectedCount, b, b); zeroArray(SIZE, c); @@ -115,14 +170,14 @@ int main(int argc, char* argv[]) { count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a); printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); expectedNPOT = count; - printArray(count, c, true); + //printArray(count, c, true); printCmpLenResult(count, expectedNPOT, b, c); zeroArray(SIZE, c); printDesc("cpu compact with scan"); count = StreamCompaction::CPU::compactWithScan(SIZE, c, a); printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - printArray(count, c, true); + //printArray(count, c, true); printCmpLenResult(count, expectedCount, b, c); zeroArray(SIZE, c); diff --git a/src/testing_helpers.hpp b/src/testing_helpers.hpp index ae94ca6..68a2450 100644 --- a/src/testing_helpers.hpp +++ b/src/testing_helpers.hpp @@ -45,6 +45,7 @@ void zeroArray(int n, int *a) { void genArray(int n, int *a, int maxval) { srand(time(nullptr)); + //srand(seed); for (int i = 0; i < n; i++) { a[i] = rand() % maxval; diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index 8fc0211..ac28fcb 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -23,7 +23,14 @@ namespace StreamCompaction { * which map to 0 will be removed, and elements which map to 1 will be kept. */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { - // TODO + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + + if (index >= n) + { + return; + } + + bools[index] = idata[index] != 0 ? 1 : 0; } /** @@ -32,7 +39,17 @@ namespace StreamCompaction { */ __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { - // TODO + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + + if (index >= n) + { + return; + } + + if (bools[index] == 1) + { + odata[indices[index]] = idata[index]; + } } } diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 05ce667..b2f43d0 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -18,9 +18,16 @@ namespace StreamCompaction { * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first. */ void scan(int n, int *odata, const int *idata) { - timer().startCpuTimer(); - // TODO - timer().endCpuTimer(); + //timer().startCpuTimer(); + + odata[0] = 0; + + for (int i = 1; i < n; i++) + { + odata[i] = odata[i - 1] + idata[i - 1]; + } + + //timer().endCpuTimer(); } /** @@ -29,10 +36,21 @@ namespace StreamCompaction { * @returns the number of elements remaining after compaction. */ int compactWithoutScan(int n, int *odata, const int *idata) { + int j = 0; + timer().startCpuTimer(); - // TODO + + for (int i = 0; i < n; i++) + { + if (idata[i] != 0) + { + odata[j++] = idata[i]; + } + } + timer().endCpuTimer(); - return -1; + + return j; } /** @@ -41,10 +59,37 @@ namespace StreamCompaction { * @returns the number of elements remaining after compaction. */ int compactWithScan(int n, int *odata, const int *idata) { + int *scanResult = new int[n]; + int j = 0; + timer().startCpuTimer(); - // TODO + + for (int i = 0; i < n; i++) + { + odata[i] = idata[i] == 0 ? 0 : 1; + } + + scan(n, scanResult, odata); + + for (int i = 0; i < n-1; i++) + { + if (odata[i] == 1) + { + odata[scanResult[i]] = idata[i]; + j++; + } + } + + if (odata[n - 1] == 1) + { + odata[scanResult[n - 1] + 1] = idata[n - 1]; + j++; + } + timer().endCpuTimer(); - return -1; + + free(scanResult); + return j; } } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 36c5ef2..2d870e5 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -4,37 +4,144 @@ #include "efficient.h" namespace StreamCompaction { - namespace Efficient { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; - } - - /** - * Performs prefix-sum (aka scan) on idata, storing the result into odata. - */ - void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); - } - - /** - * Performs stream compaction on idata, storing the result into odata. - * All zeroes are discarded. - * - * @param n The number of elements in idata. - * @param odata The array into which to store elements. - * @param idata The array of elements to compact. - * @returns The number of elements remaining after compaction. - */ - int compact(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); - return -1; - } + namespace Efficient { + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; } + + int *dev_data; + int *dev_odata; + int *dev_bools; + int *dev_indices; + int *dev_idata; + + const int BLOCK_SIZE = 32; + + __global__ void kernUpSweep(int n, int width, int *data) + { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + + if (index >= n) + { + return; + } + + data[(index + 1) * width - 1] += data[index * width - 1 + (width / 2)]; + } + + __global__ void kernDownSweep(int n, int width, int *data) + { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + + if (index >= n) + { + return; + } + + int halfIndex = index * width - 1 + (width / 2); + int fullIndex = (index + 1) * width - 1; + int oldHalfIndexValue = data[halfIndex]; + + data[halfIndex] = data[fullIndex]; + data[fullIndex] += oldHalfIndexValue; + } + + __global__ void kernSetValueToZero(int i, int *data) + { + data[i] = 0; + } + + /** + * Performs prefix-sum (aka scan) on idata, storing the result into odata. + */ + void scan(int n, int *odata, const int *idata) { + int d, numThreads, numBlocks; + + int width = 1; + int nPowerOfTwo = pow(2, ilog2ceil(n)); + int numIterations = ilog2(nPowerOfTwo) - 1; + + cudaMalloc((void**)&dev_data, nPowerOfTwo * sizeof(int)); + + cudaMemcpy(dev_data, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + //timer().startGpuTimer(); + + for (d = 0; d <= numIterations; d++) + { + width *= 2; + numThreads = nPowerOfTwo / width; + numBlocks = (numThreads + BLOCK_SIZE - 1) / BLOCK_SIZE; + + kernUpSweep << > > (numThreads, width, dev_data); + } + + kernSetValueToZero << <1, 1 >> > (nPowerOfTwo - 1, dev_data); + width = pow(2, numIterations + 2); + + for (d = numIterations; d >= 0; d--) + { + width /= 2; + numThreads = nPowerOfTwo / width; + numBlocks = (numThreads + BLOCK_SIZE - 1) / BLOCK_SIZE; + + kernDownSweep << > > (numThreads, width, dev_data); + } + + //timer().endGpuTimer(); + + cudaMemcpy(odata, dev_data, n * sizeof(int), cudaMemcpyDeviceToHost); + + cudaFree(dev_data); + } + + /** + * Performs stream compaction on idata, storing the result into odata. + * All zeroes are discarded. + * + * @param n The number of elements in idata. + * @param odata The array into which to store elements. + * @param idata The array of elements to compact. + * @returns The number of elements remaining after compaction. + */ + int compact(int n, int *odata, const int *idata) { + int size; + int numBlocks = (n + BLOCK_SIZE - 1) / BLOCK_SIZE; + + cudaMalloc((void**)&dev_idata, n * sizeof(int)); + cudaMalloc((void**)&dev_bools, n * sizeof(int)); + cudaMalloc((void**)&dev_indices, n * sizeof(int)); + cudaMalloc((void**)&dev_odata, n * sizeof(int)); + + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + timer().startGpuTimer(); + + Common::kernMapToBoolean << > > (n, dev_bools, dev_idata); + cudaMemcpy(odata, dev_bools, n * sizeof(int), cudaMemcpyDeviceToHost); + + size = odata[n - 1]; + + scan(n, odata, odata); + + size += odata[n - 1]; + + cudaMemcpy(dev_indices, odata, n * sizeof(int), cudaMemcpyHostToDevice); + Common::kernScatter << > > (n, dev_odata, dev_idata, dev_bools, dev_indices); + + timer().endGpuTimer(); + + cudaMemcpy(odata, dev_odata, n * sizeof(int), cudaMemcpyDeviceToHost); + + cudaFree(dev_idata); + cudaFree(dev_bools); + cudaFree(dev_indices); + cudaFree(dev_odata); + + return size; + } + } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 9218f8e..6e61ac2 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -4,22 +4,78 @@ #include "naive.h" namespace StreamCompaction { - namespace Naive { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; - } - // TODO: __global__ - - /** - * Performs prefix-sum (aka scan) on idata, storing the result into odata. - */ - void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); - } + namespace Naive { + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; } -} + + int *dev_input; + int *dev_output1; + int *dev_output2; + + const int BLOCK_SIZE = 64; + + __global__ void kernScan(int n, int offset, int *dev_input, int *dev_output1, int *dev_output2) + { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + + if (index >= n) + { + return; + } + + dev_output2[index] = dev_output1[index] + (index >= offset ? dev_output1[index - offset] : 0); + } + + __global__ void kernShift(int n, int *dev_input, int *dev_output) + { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + + if (index >= n) + { + return; + } + + dev_output[index] = index > 0 ? dev_input[index - 1] : 0; + } + + /** + * Performs prefix-sum (aka scan) on idata, storing the result into odata. + */ + void scan(int n, int *odata, const int *idata) { + int numBlocks = (n + BLOCK_SIZE - 1) / BLOCK_SIZE; + int *temp; + int offset; + + cudaMalloc((void**)&dev_input, n * sizeof(int)); + cudaMalloc((void**)&dev_output1, n * sizeof(int)); + cudaMalloc((void**)&dev_output2, n * sizeof(int)); + + cudaMemcpy(dev_input, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + timer().startGpuTimer(); + + kernShift << > > (n, dev_input, dev_output1); + + for (offset = 1; offset <= n; offset *= 2) + { + kernScan << > > (n, offset, dev_input, dev_output1, dev_output2); + + temp = dev_output1; + dev_output1 = dev_output2; + dev_output2 = temp; + } + + timer().endGpuTimer(); + + cudaMemcpy(odata, dev_output1, n * sizeof(int), cudaMemcpyDeviceToHost); + + cudaFree(dev_input); + cudaFree(dev_output1); + cudaFree(dev_output2); + } + } +} \ No newline at end of file diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 36b732d..79536d6 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -7,22 +7,37 @@ #include "thrust.h" namespace StreamCompaction { - namespace Thrust { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; - } - /** - * Performs prefix-sum (aka scan) on idata, storing the result into odata. - */ - void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO use `thrust::exclusive_scan` - // example: for device_vectors dv_in and dv_out: - // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); - timer().endGpuTimer(); - } + namespace Thrust { + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; } + + int *dev_idata; + int *dev_odata; + + /** + * Performs prefix-sum (aka scan) on idata, storing the result into odata. + */ + void scan(int n, int *odata, const int *idata) { + cudaMalloc((void**)&dev_idata, n * sizeof(int)); + cudaMalloc((void**)&dev_odata, n * sizeof(int)); + + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + thrust::device_ptr dev_thrust_idata(dev_idata); + thrust::device_ptr dev_thrust_odata(dev_odata); + + timer().startGpuTimer(); + thrust::exclusive_scan(dev_thrust_idata, dev_thrust_idata + n, dev_thrust_odata); + timer().endGpuTimer(); + + cudaMemcpy(odata, dev_odata, n * sizeof(int), cudaMemcpyDeviceToHost); + + cudaFree(dev_idata); + cudaFree(dev_odata); + } + } }