Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 101 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,107 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Mariano Merchante
* Tested on
* Microsoft Windows 10 Pro
* Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz, 2601 Mhz, 4 Core(s), 8 Logical Processor(s)
* 32.0 GB RAM
* NVIDIA GeForce GTX 1070 (mobile version)

### (TODO: Your README)
## Details

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
This project implements and compares different algorithms for scan and stream compaction, both in GPU and CPU. In particular, it implements a serial version, a naive parallel version, a work efficient version and an implementation from the Thrust library. A simple CSV exporter for automated performance evaluation is also included.

## Note about Work Efficient implementation

My implementation of the Work-Efficient approach tries to reduce the amount of threads dispatched to do both the up and down sweep, and thus it performs better than the naive approach. This happens because occupancy is maximized and only threads that do useful work are running.

## Analysis

It is interesting to analyze the results both in debug and release mode, as some behaviours arise. In general, CPU < Naive < Work-Efficient, but the Thrust library implementation radically changes between both.

![](images/debug_pot.png)
![](images/release_pot.png)
![](images/debug_vs_release.png)

The relative performances seem to be uniformly correlated, exept for Thrust. I suspect this happens because the library does some precomputation and thus becomes faster after some iterations, as the next graph exemplifies. It is also probably faster than my work efficient implementation, as I didn't do any shared memory optimizations. A clear drop in execution time can be seen after the first Thrust run.

![](images/iterations.png)

An example of running the algorithms with an array of the same size. Note how Thrust drops and performs better with time.

It is also useful to analyze the relationship between using power of two arrays or non power of two. Note that the Thrust pattern still emerges. Although the ratio is close to 1, the non power of two case definitely uses more memory than necessary and is going to impact the overall application.

![](images/debug_npot.png)
![](images/release_npot.png)
![](images/pot_vs_npot.png)


An example run can be seen here:
```
SIZE: 4194304
****************
** SCAN TESTS **
****************
[ 35 16 14 26 19 39 4 39 24 42 27 45 46 ... 11 0 ]
==== cpu scan, results test ====
==== PASS: YES ====
[ 0 1 6 6 7 9 9 ]
[ 0 1 6 6 7 9 9 ]
==== cpu scan, power-of-two ====
elapsed time: 8.31961ms (std::chrono Measured)
[ 0 35 51 65 91 110 149 153 192 216 258 285 330 ... 102731641 102731652 ]
==== cpu scan, non-power-of-two ====
elapsed time: 8.75102ms (std::chrono Measured)
[ 0 35 51 65 91 110 149 153 192 216 258 285 330 ... 102731502 102731543 ]
passed
==== naive scan, power-of-two ====
elapsed time: 4.20352ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 4.20454ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 2.05008ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 1.88826ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 7.67795ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 1.47558ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 0 3 3 0 1 1 3 2 2 0 1 3 3 ... 0 0 ]
==== cpu compact without scan, power-of-two ====
==== PASS: YES ====
[ 1 5 1 2 3 ]
[ 1 5 1 2 3 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 7.46312ms (std::chrono Measured)
[ 3 3 1 1 3 2 2 1 3 3 2 3 2 ... 1 1 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 4.41561ms (std::chrono Measured)
[ 3 3 1 1 3 2 2 1 3 3 2 3 2 ... 3 1 ]
passed
==== cpu compact with scan ====
elapsed time: 25.8876ms (std::chrono Measured)
[ 3 3 1 1 3 2 2 1 3 3 2 3 2 ... 1 1 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 2.75443ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 2.47562ms (CUDA Measured)
passed
Press any key to continue . . .



```
Binary file added images/debug_npot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/debug_pot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/debug_vs_release.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/iterations.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/pot_vs_npot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/release_npot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/release_pot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
107 changes: 104 additions & 3 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -6,20 +6,32 @@
* @copyright University of Pennsylvania
*/

#include <thread>
#include <iostream>
#include <fstream>
#include <vector>
#include <cstdio>
#include <stream_compaction/cpu.h>
#include <stream_compaction/naive.h>
#include <stream_compaction/efficient.h>
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
const int SIZE = 1 << 22; // feel free to change the size of array
const int NPOT = SIZE - 5; // Non-Power-Of-Two
int a[SIZE], b[SIZE], c[SIZE];

int testInput[] = { 1, 5, 0, 1, 2, 0, 3 };
int testOutput[] = { 0, 1, 6, 6, 7, 9, 9 };

int testCompactionInput[] = { 1, 5, 0, 1, 2, 0, 3 };
int testCompactionOutput[] = { 1, 5, 1, 2, 3 };

int main(int argc, char* argv[]) {
// Scan tests

printf("SIZE: %d", SIZE);

// Scan tests
printf("\n");
printf("****************\n");
printf("** SCAN TESTS **\n");
Expand All @@ -29,6 +41,18 @@ int main(int argc, char* argv[]) {
a[SIZE - 1] = 0;
printArray(SIZE, a, true);

// Test example
zeroArray(7, b);
printDesc("cpu scan, results test");
StreamCompaction::CPU::scan(7, b, testInput);
bool pass = true;
for (int i = 0; i < 7; ++i)
if (testOutput[i] != b[i])
pass = false;
printDesc((std::string("PASS: ") + (pass ? "YES": "NO")).c_str());
printArray(7, testOutput, true);
printArray(7, b, true);

// initialize b using StreamCompaction::CPU::scan you implement
// We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct.
// At first all cases passed because b && c are all zeroes.
Expand Down Expand Up @@ -100,6 +124,16 @@ int main(int argc, char* argv[]) {

int count, expectedCount, expectedNPOT;

// Test results
zeroArray(SIZE, b);
printDesc("cpu compact without scan, power-of-two");
count = StreamCompaction::CPU::compactWithoutScan(7, b, testCompactionInput);
pass = cmpArrays(5, testCompactionOutput, b) == 0;
printDesc((std::string("PASS: ") + (pass ? "YES" : "NO")).c_str());
expectedCount = count;
printArray(count, b, true);
printArray(5, testCompactionOutput, true);

// initialize b using StreamCompaction::CPU::compactWithoutScan you implement
// We use b for further comparison. Make sure your StreamCompaction::CPU::compactWithoutScan is correct.
zeroArray(SIZE, b);
Expand Down Expand Up @@ -139,5 +173,72 @@ int main(int argc, char* argv[]) {
//printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);

bool generateCSV = false;

if (generateCSV)
{
bool useNPOT = true;
int steps = 22;
std::vector<std::vector<float>> timeData;

for (int i = 1; i < steps + 1; ++i)
{
int size = (1 << i);

if (useNPOT)
size = (size - 3 > 0) ? size - 3 : size;

int * data = new int[size];
int * result = new int[size];
genArray(size, data, i * 5);
zeroArray(size, data);

std::vector<float> stepData;
stepData.push_back(size);

// Run each implementation -- we don't care about the results (the previous tests cover that)
{
zeroArray(size, result);
StreamCompaction::CPU::scan(size, result, data);
stepData.push_back(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation());

zeroArray(size, result);
StreamCompaction::Naive::scan(size, result, data);
stepData.push_back(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation());

zeroArray(size, result);
StreamCompaction::Efficient::scan(size, result, data);
stepData.push_back(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation());

zeroArray(size, result);
StreamCompaction::Thrust::scan(size, result, data);
stepData.push_back(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation());
}

timeData.push_back(stepData);

delete[] data;
delete[] result;
}

std::ofstream fstr;
fstr.open("data.csv", std::ofstream::out);

for (int i = 0; i < timeData.size(); ++i)
{
std::string line = "";

for (int j = 0; j < timeData[i].size(); ++j)
line += std::to_string(timeData[i][j]) + ", "; // Parsers remove this

line += "\n";
std::cout << line << std::endl;
fstr.write(line.c_str(), line.length());
}

fstr.close();
}


system("pause"); // stop Win32 console from closing on exit
}
10 changes: 9 additions & 1 deletion src/testing_helpers.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -44,13 +44,21 @@ void zeroArray(int n, int *a) {
}

void genArray(int n, int *a, int maxval) {
srand(time(nullptr));
srand(time(nullptr));

for (int i = 0; i < n; i++) {
a[i] = rand() % maxval;
}
}

void genArrayNonZero(int n, int *a, int maxval) {
srand(time(nullptr));

for (int i = 0; i < n; i++) {
a[i] = (rand() % maxval) + 1;
}
}

void printArray(int n, int *a, bool abridged = false) {
printf(" [ ");
for (int i = 0; i < n; i++) {
Expand Down
26 changes: 21 additions & 5 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#include "common.h"
#include <device_launch_parameters.h>

void checkCUDAErrorFn(const char *msg, const char *file, int line) {
cudaError_t err = cudaGetLastError();
Expand All @@ -22,17 +23,32 @@ namespace StreamCompaction {
* Maps an array to an array of 0s and 1s for stream compaction. Elements
* which map to 0 will be removed, and elements which map to 1 will be kept.
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
__global__ void kernMapToBoolean(int n, int *bools, const int *idata)
{
int index = (blockIdx.x * blockDim.x) + threadIdx.x;

if (index >= n)
return;

bools[index] = (idata[index] != 0 ? 1 : 0);
}

/**
* Performs scatter on an array. That is, for each element in idata,
* if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
*/
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
__global__ void kernScatter(int n, int *odata, const int *idata, const int *indices)
{
int index = (blockIdx.x * blockDim.x) + threadIdx.x;

if (index >= n)
return;

if (idata[index] != 0)
{
int b = indices[index]; // For some reason, odata[indices[index]] was having race conditions when indices matched
odata[b] = idata[index];
}
}

}
Expand Down
3 changes: 1 addition & 2 deletions stream_compaction/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,7 @@ namespace StreamCompaction {
namespace Common {
__global__ void kernMapToBoolean(int n, int *bools, const int *idata);

__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices);
__global__ void kernScatter(int n, int *odata, const int *idata, const int *indices);

/**
* This class is used for timing the performance
Expand Down
Loading