Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added 1-CUDA-Introduction-1.pptx
Binary file not shown.
Binary file added 2-GPU-Architecture-Overview.pptx
Binary file not shown.
Binary file added 3-Parallel-Algorithms-1.pptx
Binary file not shown.
Binary file added BugPNG.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Profiling.pdf
Binary file not shown.
Binary file added Profiling.xlsx
Binary file not shown.
Binary file added Profiling_Page_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Profiling_Page_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Profiling_Page_3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Profiling_Page_4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Profiling_Page_5.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
67 changes: 62 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,68 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Ethan Brooks
* Tested on: Windows 7, Intel(R) Xeon(R), GeForce GTX 1070 8GB (SIG Lab)

### (TODO: Your README)
For this project we compare several implementations of two fundamental GPU algorithms:

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
- `scan`: similar to `fold` in functional programming. `scan` takes an input array and a binary operator and returns an array where each element is the reduction of the preceding elements in the input array using the binary operator. For example, if the binary operator is addition, as it was in our implementation, each element in the output array is the sum of all preceding elements in the input array.

- `stream-compact`: equivalent to `filter` in functional programming. `stream-compact` takes an input array and a test function and returns a shortened version of the input array containing only elements that pass the test function. In our implementations, we use an implicit test function that passes only if a number is not equal to zero. In other words, our functions filters out all zeros from the input array.

## Scan
We implemented three versions of `scan`:

- CPU scan: this implementation does not use the GPU. Instead it simply iterates through the input array on the CPU accumulating sums and writing them to the output array as it goes.

- Naive scan: this implementation iteratively adds elements at different strides: on the first iteration, the algorithm adds all elements that are directly adjacent; On the second iteration, the algorithm adds elements that are separated by a stride of two; on the third, elements are added at a stride of four. Strides continue to double until only one addition is performed. The following picture depicts this process:
![alt text] (https://github.com/lobachevzky/Project2-Stream-Compaction/blob/master/parallel%20scan.png)
Since the process terminates when the doubling strides exceed the length of the input array, the total number of iterations is O(log n), where n is the number of elements in the input array. Since each thread performs a single addition, the time complexity of the algorithm as a whole is O(log n), _assuming that there are O(n) threads_.

- Efficient scan: Naive scan is perfectly effective when there are as many threads as elements in the input array. However, this is not the case for larger arrays and instead threads must round-robin through waiting kernels. In this case, it is advantageous to minimize the number of threads launched at once. Naive scan actually performs _O(n log n)_ addition operations and consequently launches a total of _O(n log n)_ threads. Since the CPU version performs only _n_ addition operations, this suggests that we can do better. Efficient scan uses a clever upsweep/downsweep approach that achieves the logarithmic time complexity of the naive version but also performs only one addition per element in the array.

- Thrust scan: this is an implementation from Thrust, a C++ library of GPU algorithms.

The following diagram compares performance between these implementations:
![alt text] (https://github.com/lobachevzky/Project2-Stream-Compaction/blob/master/Profiling_Page_2.png)

Note that the x-axis in this diagram is logarithmic (that is, at each tick along the x-axis, the number of elements doubles).

There are a few curiosities about these results. First, we observe that efficient scan, naive scan, and Thrust scan, all appear to be running with linear time complexity. This is the case because the GeForce GTX 1070 only has 1920 CUDA cores. In any GPU, the number of threads is constant and does not increase with the size of the inputs. However in more powerful GPUs, this can still lead to substantial performance increases by parallelizing a large fraction of the operations and dividing the total execution time by a large constant number. However as the number elements in the array increases relative to the number of threads executing in parallel, this improvement becomes less evident. In fact, the constant time improvement may be offset by the shortcomings of the GPU, namely its lack of optimizations for sequential operations (e.g. pipelining).

This explains why the performance of the GPU implementations is comparable to (and in many cases worse than) the performance of the CPU implementation. It still does not explain the fact that naive scan outperforms all the other GPU algorithms, and efficient scan performs worst of all. One possible reason for this is that efficient scan reduces the total number of addition operations by O(log n). However, it requires the operation to be split into an upsweep and a downsweep, the latter having to wait for the former to complete. At the end of both these operations very few cores are active -- in fact, at the very end, only one core is active since approximately only one addition operation is performed on the last iteration. This is a problem for the naive implementation as well but the naive implementation only encounters this situation once, whereas the efficient implementation encounters this situation twice. Consequently the naive version actually has higher hardware saturation though it is also doing more work overall.

Another likely explanation is that the efficient implementation performs almost twice as many memory accesses per kernel invocation -- it performs three in the upsweep kernel and four in the downsweep whereas the naive implementation performs only two memory accesses per kernel invocation. Since the GPU can only perform a limited number of simultaneous memory accesses, this might be a bottleneck that hinders performance on the efficient implementation.

## Stream Compaction
We also implemented three versions of `stream-compact`.

- efficient stream-compact: this version takes three steps:

1. We call a kernel to populate a new array of booleans that determines whether an element passes our test function (in our case, whether an element is nonzero).
2. We perform `scan` on the array of booleans. We use the efficient version of `scan` described in the previous section.
3. The output of `scan` actually corresponds to indices where the nonzero elements of the input array should be assigned. We use this information to assign elements of the input array to the output array in parallel.

- compact with scan: in order to emulate the GPU version, we implement a version on the CPU that uses the CPU scan algorithm. As the diagram below demonstrates, this has no performance benefit on the CPU.

- compact without scan: this final implementation is a straightforward CPU implementation that simply iterates through the input array and assigns all nonzero elements to the output array.

The following diagram compares performance across these implementations. Again, the x-axis is logarithmic.

![alt text] (https://github.com/lobachevzky/Project2-Stream-Compaction/blob/master/Profiling_Page_3.png).
In this case, we observe that the GPU implementation does outperform the CPU implementations. However, as discussed in the previous section, this cannot be credited to the `scan` operation, and must have more to do with steps 1. and 3. -- the parallel testing and assignment of each element in the input array to the output array.

## Large numbers bug
A strange bug that remains unresolved in this project is that our GPU algorithms simply would not run for arrays larger than 2^16. The following picture displays the result of running an array with 2^17 elements:
![alt text] (https://github.com/lobachevzky/Project2-Stream-Compaction/blob/master/BugPNG.PNG)
This appears to be a compiler error, since the program encounters no errors for any of the smaller arrays that we tested. One unfortunate consequence of this issue is that it was impossible to compare our GPU implementations with the CPU algorithms on very large arrays, where the GPU may have indeed had the advantage. We were able to compare the Thrust implementation of scan with the CPU version at sizes as large as 2^29. The following graph depicts the results:
![alt text] (https://github.com/lobachevzky/Project2-Stream-Compaction/blob/master/Profiling_Page_1.png)
It is interesting to note that the CPU _still_ outperforms even the optimized Thrust implementation for the GPU. Also, the Thrust implementation causes the same bug that we described earlier for arrays larger than 2^29. Again we reason that the poor performance of the GPU can be credited to the poor throughput of the GeForce GTX 1070.

## blockSize optimizations
On both the naive and efficient implementations, we experimented with different block sizes and record the results in the charts below:

![alt text] (https://github.com/lobachevzky/Project2-Stream-Compaction/blob/master/Profiling_Page_4.png)
![alt text] (https://github.com/lobachevzky/Project2-Stream-Compaction/blob/master/Profiling_Page_5.png)

Because of these experiments, we ran all of the earlier experiments using block sizes of 256.
Binary file added parallel scan.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 2 additions & 1 deletion src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,15 @@
*/

#include <cstdio>
#include <cuda.h>
#include <stream_compaction/cpu.h>
#include <stream_compaction/naive.h>
#include <stream_compaction/efficient.h>
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"

int main(int argc, char* argv[]) {
const int SIZE = 1 << 8;
const int SIZE = 1 << 15;
const int NPOT = SIZE - 3;
int a[SIZE], b[SIZE], c[SIZE];

Expand Down
87 changes: 58 additions & 29 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
@@ -1,39 +1,68 @@
#include "common.h"

void checkCUDAErrorFn(const char *msg, const char *file, int line) {
cudaError_t err = cudaGetLastError();
if (cudaSuccess == err) {
return;
}

fprintf(stderr, "CUDA error");
if (file) {
fprintf(stderr, " (%s:%d)", file, line);
}
fprintf(stderr, ": %s: %s\n", msg, cudaGetErrorString(err));
exit(EXIT_FAILURE);
int getNumBlocks(int blockSize, int n) {
return (n + blockSize - 1) / blockSize;
}


namespace StreamCompaction {
namespace Common {

/**
* Maps an array to an array of 0s and 1s for stream compaction. Elements
* which map to 0 will be removed, and elements which map to 1 will be kept.
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
void printArray(int n, const int *a, bool abridged) {
printf(" [ ");
for (int i = 0; i < n; i++) {
if (abridged && i + 2 == 15 && n > 16) {
i = n - 2;
printf("... ");
}
printf("%3d ", a[i]);
}
printf("]\n");
}

/**
* Performs scatter on an array. That is, for each element in idata,
* if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
*/
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO

void checkCUDAErrorFn(const char *msg, const char *file, int line) {
cudaError_t err = cudaGetLastError();
if (cudaSuccess == err) {
return;
}

fprintf(stderr, "CUDA error");
if (file) {
fprintf(stderr, " (%s:%d)", file, line);
}
fprintf(stderr, ": %s: %s\n", msg, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}

__device__ int threadIndex(){
return (blockIdx.x * blockDim.x) + threadIdx.x;
}


namespace StreamCompaction {
namespace Common {

/**
* Maps an array to an array of 0s and 1s for stream compaction. Elements
* which map to 0 will be removed, and elements which map to 1 will be kept.
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
int index = threadIndex();
if (index >= n) return;

bools[index] = idata[index] != 0;
}

/**
* Performs scatter on an array. That is, for each element in idata,
* if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
*/
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
int index = threadIndex();
if (index >= n) return;

if (bools[index] == 1) {
odata[indices[index]] = idata[index];
}
}

}
}
32 changes: 19 additions & 13 deletions stream_compaction/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -7,29 +7,35 @@
#define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
#define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)

void printArray(int n, const int *a, bool abridged = false);

int getNumBlocks(int blockSize, int n);

/**
* Check for CUDA errors; print and exit if there was a problem.
*/
* Check for CUDA errors; print and exit if there was a problem.
*/
void checkCUDAErrorFn(const char *msg, const char *file = NULL, int line = -1);

inline int ilog2(int x) {
int lg = 0;
while (x >>= 1) {
++lg;
}
return lg;
int lg = 0;
while (x >>= 1) {
++lg;
}
return lg;
}

inline int ilog2ceil(int x) {
return ilog2(x - 1) + 1;
return ilog2(x - 1) + 1;
}


__device__ int threadIindex();
namespace StreamCompaction {
namespace Common {
__global__ void kernMapToBoolean(int n, int *bools, const int *idata);
namespace Common {
__global__ void kernMapToBoolean(int n, int *bools, const int *idata);

__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices);
}
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices);

}
}
85 changes: 57 additions & 28 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
@@ -1,36 +1,65 @@
#include <cstdio>
#include "cpu.h"
#include "common.h"
#include <stdlib.h>
#include <cuda.h>
#include <cuda_runtime.h>

namespace StreamCompaction {
namespace CPU {

/**
* CPU scan (prefix sum).
*/
void scan(int n, int *odata, const int *idata) {
// TODO
printf("TODO\n");
}
namespace CPU {

/**
* CPU stream compaction without using the scan function.
*
* @returns the number of elements remaining after compaction.
*/
int compactWithoutScan(int n, int *odata, const int *idata) {
// TODO
return -1;
}
/**
* CPU scan (prefix sum).
*/
void scan(int n, int *odata, const int *idata) {
// TODO
int valToWrite = 0;
int sumOfPrev2;
for (int i = 0; i < n; i++) {
sumOfPrev2 = valToWrite + idata[i];
odata[i] = valToWrite;
valToWrite = sumOfPrev2;
}
}

/**
* CPU stream compaction using scan and scatter, like the parallel version.
*
* @returns the number of elements remaining after compaction.
*/
int compactWithScan(int n, int *odata, const int *idata) {
// TODO
return -1;
}
/**
* CPU stream compaction without using the scan function.
*
* @returns the number of elements remaining after compaction.
*/
int compactWithoutScan(int n, int *odata, const int *idata) {
// TODO
int j = 0;
for (int i = 0; i < n; i++) {
if (idata[i] != 0) {
odata[j++] = idata[i];
}
}
return j;
}

}
/**
* CPU stream compaction using scan and scatter, like the parallel version.
*
* @returns the number of elements remaining after compaction.
*/
int compactWithScan(int n, int *odata, const int *idata) {
// TODO

for (int i = 0; i < n; i++) {
odata[i] = idata[i] != 0;
}

scan(n, odata, odata);


int retVal = odata[n - 1] + (idata[n - 1] != 0);
for (int i = 0; i < n; i++) {
if (idata[i] != 0) {
odata[odata[i]] = idata[i];
}
}
return retVal;
}
}
}
Loading