diff --git a/CPUTiming.png b/CPUTiming.png new file mode 100644 index 0000000..8870aa3 Binary files /dev/null and b/CPUTiming.png differ diff --git a/NaiveScanTiming.png b/NaiveScanTiming.png new file mode 100644 index 0000000..286990e Binary files /dev/null and b/NaiveScanTiming.png differ diff --git a/README.md b/README.md index 6e02afa..2c91913 100644 --- a/README.md +++ b/README.md @@ -3,131 +3,31 @@ Project-2 A Study in Parallel Algorithms : Stream Compaction -# INTRODUCTION -Many of the algorithms you have learned thus far in your career have typically -been developed from a serial standpoint. When it comes to GPUs, we are mainly -looking at massively parallel work. Thus, it is necessary to reorient our -thinking. In this project, we will be implementing a couple different versions -of prefix sum. We will start with a simple single thread serial CPU version, -and then move to a naive GPU version. Each part of this homework is meant to -follow the logic of the previous parts, so please do not do this homework out of -order. - -This project will serve as a stream compaction library that you may use (and -will want to use) in your -future projects. For that reason, we suggest you create proper header and CUDA -files so that you can reuse this code later. You may want to create a separate -cpp file that contains your main function so that you can test the code you -write. - -# OVERVIEW -Stream compaction is broken down into two parts: (1) scan, and (2) scatter. - -## SCAN -Scan or prefix sum is the summation of the elements in an array such that the -resulting array is the summation of the terms before it. Prefix sum can either -be inclusive, meaning the current term is a summation of all the elements before -it and itself, or exclusive, meaning the current term is a summation of all -elements before it excluding itself. - -Inclusive: - -In : [ 3 4 6 7 9 10 ] - -Out : [ 3 7 13 20 29 39 ] - -Exclusive - -In : [ 3 4 6 7 9 10 ] - -Out : [ 0 3 7 13 20 29 ] - -Note that the resulting prefix sum will always be n + 1 elements if the input -array is of length n. Similarly, the first element of the exclusive prefix sum -will always be 0. In the following sections, all references to prefix sum will -be to the exclusive version of prefix sum. - -## SCATTER -The scatter section of stream compaction takes the results of the previous scan -in order to reorder the elements to form a compact array. - -For example, let's say we have the following array: -[ 0 0 3 4 0 6 6 7 0 1 ] - -We would only like to consider the non-zero elements in this zero, so we would -like to compact it into the following array: -[ 3 4 6 6 7 1 ] - -We can perform a transform on input array to transform it into a boolean array: - -In : [ 0 0 3 4 0 6 6 7 0 1 ] - -Out : [ 0 0 1 1 0 1 1 1 0 1 ] - -Performing a scan on the output, we get the following array : - -In : [ 0 0 1 1 0 1 1 1 0 1 ] - -Out : [ 0 0 0 1 2 2 3 4 5 5 ] - -Notice that the output array produces a corresponding index array that we can -use to create the resulting array for stream compaction. - -# PART 1 : REVIEW OF PREFIX SUM -Given the definition of exclusive prefix sum, please write a serial CPU version -of prefix sum. You may write this in the cpp file to separate this from the -CUDA code you will be writing in your .cu file. - -# PART 2 : NAIVE PREFIX SUM -We will now parallelize this the previous section's code. Recall from lecture -that we can parallelize this using a series of kernel calls. In this portion, -you are NOT allowed to use shared memory. - -### Questions -* Compare this version to the serial version of exclusive prefix scan. Please - include a table of how the runtimes compare on different lengths of arrays. -* Plot a graph of the comparison and write a short explanation of the phenomenon you - see here. - -# PART 3 : OPTIMIZING PREFIX SUM -In the previous section we did not take into account shared memory. In the -previous section, we kept everything in global memory, which is much slower than -shared memory. - -## PART 3a : Write prefix sum for a single block -Shared memory is accessible to threads of a block. Please write a version of -prefix sum that works on a single block. - -## PART 3b : Generalizing to arrays of any length. -Taking the previous portion, please write a version that generalizes prefix sum -to arbitrary length arrays, this includes arrays that will not fit on one block. - -### Questions -* Compare this version to the parallel prefix sum using global memory. -* Plot a graph of the comparison and write a short explanation of the phenomenon - you see here. - -# PART 4 : ADDING SCATTER -First create a serial version of scatter by expanding the serial version of -prefix sum. Then create a GPU version of scatter. Combine the function call -such that, given an array, you can call stream compact and it will compact the -array for you. Finally, write a version using thrust. - -### Questions -* Compare your version of stream compact to your version using thrust. How do - they compare? How might you optimize yours more, or how might thrust's stream - compact be optimized. - -# EXTRA CREDIT (+10) -For extra credit, please optimize your prefix sum for work parallelism and to -deal with bank conflicts. Information on this can be found in the GPU Gems -chapter listed in the references. - -# SUBMISSION -Please answer all the questions in each of the subsections above and write your -answers in the README by overwriting the README file. In future projects, we -expect your analysis to be similar to the one we have led you through in this -project. Like other projects, please open a pull request and email Harmony. - -# REFERENCES -"Parallel Prefix Sum (Scan) with CUDA." GPU Gems 3. +What I have Done: + 1.) Scan + - CPU + - Naive GPU + - One Block GPU + - N Block GPU + 2.) Scatter + - CPU + - GPU (w/ One Block) + 3.) Compact + - CPU + - GPU (w/ One Block) + +What I have Started: + - Compaction using Thrust. I put this to the side once I had trouble with device/host vectors and lots of crashing my system. The thrust documentation is pretty confusing, and the compiler wasn't catching enough of my dumb mistakes to help me get anywhere. + - N Block Scan Optimizations. I suspect that there is a better way to do this than my brute-force-ish approach that doesn't seem like it utilizes parallelism very well. + +Known Bugs: + - Naive Scan and N Block Scan come up short by 1 depth iteration consistently. I can't find the source of this, since the calculation is the same as in One Block. It seems to be imprecision in the floating point operations on the GPU. + - The last element of Compact can be pretty hit or miss. I use extrinsic calculations for all of the intermediary steps, so the value of the last element isn't maintained. + +What I have learned: + - Start earlier. Even though Patrick projected this to take 3-5 hours, putting it off until the weekend wasn't wise. I spent upwards of 15 hours on it and didn't get a chance to do much analysis or optimization. + - Templates + CUDA can be rough. I tried templating all of my stuff so it could be used with arrays of different data types. A lot of times this resulted in strange and unhelpful compiler errors that not even the internet knew anything about. + +Analysis: + - CPU Performance - I've uploaded a screenshot of console output showing CPU speeds for prefix sum for a wide variety of array lengths. It seems as though it is O(1) for these calculations, at least up to the resolution of timing measurement. More analysis could show whether there is a point where this acutally breaks down. Note: all of these are using arrays of floats, not ints. + - Naive Scan Performance - I've uploaded a screenshot of console output showing GPU speeds for scan matching those for the CPU evaluation. For all cases, it is at least 2 orders of magnitude slower than the CPU implementation, though I didn't average the time over 100+ runs which would allow the GPU to start to speed up. After lengths 100,000 and up, the timing seems to go to O(n) once we start to lose enough time going back and forth to global memory. diff --git a/StreamCompaction/StreamCompaction.sln b/StreamCompaction/StreamCompaction.sln new file mode 100644 index 0000000..fb31553 --- /dev/null +++ b/StreamCompaction/StreamCompaction.sln @@ -0,0 +1,20 @@ + +Microsoft Visual Studio Solution File, Format Version 11.00 +# Visual Studio 2010 +Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "StreamCompaction", "StreamCompaction\StreamCompaction.vcxproj", "{9B4AE667-EC32-45C1-8F74-FD42DAF1D7EA}" +EndProject +Global + GlobalSection(SolutionConfigurationPlatforms) = preSolution + Debug|Win32 = Debug|Win32 + Release|Win32 = Release|Win32 + EndGlobalSection + GlobalSection(ProjectConfigurationPlatforms) = postSolution + {9B4AE667-EC32-45C1-8F74-FD42DAF1D7EA}.Debug|Win32.ActiveCfg = Debug|Win32 + {9B4AE667-EC32-45C1-8F74-FD42DAF1D7EA}.Debug|Win32.Build.0 = Debug|Win32 + {9B4AE667-EC32-45C1-8F74-FD42DAF1D7EA}.Release|Win32.ActiveCfg = Release|Win32 + {9B4AE667-EC32-45C1-8F74-FD42DAF1D7EA}.Release|Win32.Build.0 = Release|Win32 + EndGlobalSection + GlobalSection(SolutionProperties) = preSolution + HideSolutionNode = FALSE + EndGlobalSection +EndGlobal diff --git a/StreamCompaction/StreamCompaction/StreamCompaction.vcxproj b/StreamCompaction/StreamCompaction/StreamCompaction.vcxproj new file mode 100644 index 0000000..ce98dd3 --- /dev/null +++ b/StreamCompaction/StreamCompaction/StreamCompaction.vcxproj @@ -0,0 +1,84 @@ + + + + + Debug + Win32 + + + Release + Win32 + + + + {9B4AE667-EC32-45C1-8F74-FD42DAF1D7EA} + StreamCompaction + + + + Application + true + MultiByte + + + Application + false + true + MultiByte + + + + + + + + + + + + + + + + Level3 + Disabled + + + true + %(AdditionalLibraryDirectories) + cudart.lib;%(AdditionalDependencies) + + + + + Level3 + MaxSpeed + true + true + + + true + true + true + + + + + Document + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/StreamCompaction/StreamCompaction/StreamCompaction.vcxproj.filters b/StreamCompaction/StreamCompaction/StreamCompaction.vcxproj.filters new file mode 100644 index 0000000..3808e28 --- /dev/null +++ b/StreamCompaction/StreamCompaction/StreamCompaction.vcxproj.filters @@ -0,0 +1,44 @@ + + + + + {4FC737F1-C7A5-4376-A066-2A32D752A2FF} + cpp;c;cc;cxx;def;odl;idl;hpj;bat;asm;asmx + + + {93995380-89BD-4b04-88EB-625FBE52EBFB} + h;hpp;hxx;hm;inl;inc;xsd + + + {67DA6AB6-F800-4c08-8B7A-83BB121AAD01} + rc;ico;cur;bmp;dlg;rc2;rct;bin;rgs;gif;jpg;jpeg;jpe;resx;tiff;tif;png;wav;mfcribbon-ms + + + + + Source Files + + + Source Files + + + Source Files + + + + + Source Files + + + + + Header Files + + + Header Files + + + Header Files + + + \ No newline at end of file diff --git a/StreamCompaction/StreamCompaction/main.cpp b/StreamCompaction/StreamCompaction/main.cpp new file mode 100644 index 0000000..c470bef --- /dev/null +++ b/StreamCompaction/StreamCompaction/main.cpp @@ -0,0 +1,260 @@ + +// C Dependencies +#include +#include +#include +#include + +// Project Dependencies +#include "prefix_sum.h" +#include "timing_utils.h" +#include "thrust_compact.h" + +//Global data for run settings +int LENGTH = 100; +const int MAX_LENGTH = 100; +const int MAX = 10; +const bool TIMING = true; +const bool PRINT_DATA = true; + +template +void cpuPrefixSum(const T* in, T* out, int size) { + T sum = static_cast(0.0f); + for (int i = 0; i < size; i++) { + out[i] = sum; + sum += in[i]; + } +} + +template +void cpuScatter(const T* in, const T threshold, int* out, int size) { + int* temp = (int*) malloc(size*sizeof(int)); + for (int i = 0; i < size; i++) { + temp[i] = static_cast(in[i] > threshold); + } + cpuPrefixSum(temp, out, size); + free(temp); +} + +template +int cpuCompact(const T* in, const T threshold, T* out, int size_in) { + int* temp = (int*) malloc(size_in*sizeof(int)); + cpuScatter(in, threshold, temp, size_in); + int result = temp[size_in-1]; + int current = result; + for (int i = (size_in-1); i >= 0; i--) { + if (temp[i] == current) { + out[current] = in[i]; + current--; + } + } + return result+1; +} + +void printResults(float* data, int size) { + fprintf(stdout, "["); + for (size_t i = 0; i < size; i++) { + if (i != (size-1)) { + fprintf(stdout, "%f, ", data[i]); + } else { + fprintf(stdout, "%f", data[i]); + } + } + fprintf(stdout, "]\n"); +} + +void printResults(int* data, int size) { + fprintf(stdout, "["); + for (size_t i = 0; i < size; i++) { + if (i != (size-1)) { + fprintf(stdout, "%d, ", data[i]); + } else { + fprintf(stdout, "%d", data[i]); + } + } + fprintf(stdout, "]\n"); +} + +void printResults(bool* data, int size) { + fprintf(stdout, "["); + for (size_t i = 0; i < size; i++) { + if (i != (size-1)) { + fprintf(stdout, "%d, ", data[i] ? 1 : 0); + } else { + fprintf(stdout, "%d", data[i] ? 1 : 0); + } + } + fprintf(stdout, "]\n"); +} + +int main(int argc, char** argv) { + + int new_length; + + for (LENGTH = 10; LENGTH <= MAX_LENGTH; LENGTH *= 10) { + + // Seed the random number generator + srand(time(NULL)); + + // Initialize data (randomly) + int *start = (int*) malloc(LENGTH*sizeof(int)); + int *end = (int*) malloc(LENGTH*sizeof(int)); + for (int i = 0; i < LENGTH; i++) { + start[i] = (int) ( (float) rand() / (float) (RAND_MAX/MAX) ); + } + int *scat_res = (int*) malloc(LENGTH*sizeof(int)); + if (PRINT_DATA) { + fprintf(stdout, "Input Array:\n"); + printResults(start, LENGTH); + fprintf(stdout, "-------- \n"); + } + + //Call the CPU implementation + fprintf(stdout, "CPU Scan:\n"); + if (!TIMING) { + cpuPrefixSum(start, end, LENGTH); + } else { + startTiming(); + cpuPrefixSum(start, end, LENGTH); + float timeTaken = stopTiming(); + fprintf(stdout, "Length: %d, Execution time: %f ms.\n", LENGTH, timeTaken); + } + if (PRINT_DATA) { + printResults(end, LENGTH); + } + fprintf(stdout, "-------- \n"); + + //Call the naive GPU implementation + fprintf(stdout, "Naive GPU Scan:\n"); + if (!TIMING) { + gpuNaivePrefixSumI(start, end, LENGTH); + } else { + startTiming(); + gpuNaivePrefixSumI(start, end, LENGTH); + float timeTaken = stopTiming(); + fprintf(stdout, "Length: %d, Execution time: %f ms.\n", LENGTH, timeTaken); + } + if (PRINT_DATA) { + printResults(end, LENGTH); + } + fprintf(stdout, "-------- \n"); + + //Call the one block GPU implementation + fprintf(stdout, "One Block GPU Scan:\n"); + if (!TIMING) { + gpuOneBlockPrefixSumI(start, end, LENGTH); + } else { + startTiming(); + gpuOneBlockPrefixSumI(start, end, LENGTH); + float timeTaken = stopTiming(); + fprintf(stdout, "Length: %d, Execution time: %f ms.\n", LENGTH, timeTaken); + } + if (PRINT_DATA) { + printResults(end, LENGTH); + } + fprintf(stdout, "-------- \n"); + + //Call the multi block GPU implementation + fprintf(stdout, "Multi Block GPU Scan:\n"); + if (!TIMING) { + gpuNBlockPrefixSumI(start, end, LENGTH); + } else { + startTiming(); + gpuNBlockPrefixSumI(start, end, LENGTH); + float timeTaken = stopTiming(); + fprintf(stdout, "Length: %d, Execution time: %f ms.\n", LENGTH, timeTaken); + } + if (PRINT_DATA) { + printResults(end, LENGTH); + } + fprintf(stdout, "-------- \n"); + + //Call CPU Scatter + fprintf(stdout, "CPU Scatter:\n"); + if (!TIMING) { + cpuScatter(start, 5, scat_res, LENGTH); + } else { + startTiming(); + cpuScatter(start, 5, scat_res, LENGTH); + float timeTaken = stopTiming(); + fprintf(stdout, "Length: %d, Execution time: %f ms.\n", LENGTH, timeTaken); + } + if (PRINT_DATA) { + printResults(scat_res, LENGTH); + } + fprintf(stdout, "-------- \n"); + + //Call GPU Scatter + fprintf(stdout, "GPU Scatter:\n"); + if (!TIMING) { + gpuScatterI(start, 5, scat_res, LENGTH); + } else { + startTiming(); + gpuScatterI(start, 5, scat_res, LENGTH); + float timeTaken = stopTiming(); + fprintf(stdout, "Length: %d, Execution time: %f ms.\n", LENGTH, timeTaken); + } + if (PRINT_DATA) { + printResults(scat_res, LENGTH); + } + fprintf(stdout, "-------- \n"); + + //Call CPU Compact + fprintf(stdout, "CPU Compact:\n"); + if (!TIMING) { + new_length = cpuCompact(start, 5, end, LENGTH); + } else { + startTiming(); + new_length = cpuCompact(start, 5, end, LENGTH); + float timeTaken = stopTiming(); + fprintf(stdout, "Length: %d, Execution time: %f ms.\n", LENGTH, timeTaken); + } + if (PRINT_DATA) { + printResults(end, new_length); + } + fprintf(stdout, "-------- \n"); + + //Call GPU Compact + fprintf(stdout, "GPU Compact:\n"); + if (!TIMING) { + new_length = gpuCompactI(start, 5, end, LENGTH); + } else { + startTiming(); + new_length = gpuCompactI(start, 5, end, LENGTH); + float timeTaken = stopTiming(); + fprintf(stdout, "Length: %d, Execution time: %f ms.\n", LENGTH, timeTaken); + } + if (PRINT_DATA) { + printResults(end, new_length); + } + fprintf(stdout, "-------- \n"); + + //Call Thrust Compact + /* + fprintf(stdout, "Thrust Compact:\n"); + if (!TIMING) { + new_length = thrustCompactI(start, 5, end, LENGTH); + } else { + startTiming(); + new_length = thrustCompactI(start, 5, end, LENGTH); + float timeTaken = stopTiming(); + fprintf(stdout, "Length: %d, Execution time: %f ms.\n", LENGTH, timeTaken); + } + if (PRINT_DATA) { + printResults(end, new_length); + } + fprintf(stdout, "-------- \n"); + */ + + //Free up memory + free(start); + free(end); + free(scat_res); + } + + while (true) { + + } + + return 0; +} diff --git a/StreamCompaction/StreamCompaction/prefix_sum.cu b/StreamCompaction/StreamCompaction/prefix_sum.cu new file mode 100644 index 0000000..97c7967 --- /dev/null +++ b/StreamCompaction/StreamCompaction/prefix_sum.cu @@ -0,0 +1,439 @@ + +// C/CUDA Dependencies +#include +#include + +// Project Dependencies +#include "prefix_sum.h" + +template +__global__ void naive_prefix_sum(T* in, T* out, int* size) { + + int index = threadIdx.x; //Keep it simple, only use x since arrays are 1 dimensional + + //Start by shifting right since the calculation is going to be inclusive and we want exclusive + if (index > 0) { + out[index] = in[index-1]; + } else { + out[index] = 0.0f; + } + __syncthreads(); + + // Switch the output back with the input + T* temp1 = in; + in = out; + out = temp1; + + //Calculate the max depth + int max_depth = ceil(log((float) *size)/log(2.0f)); + + //Loop over each depth + for (int d = 1; d <= max_depth; d++) { + + //Calculate the offset for the current depth + int off = pow(2.0f, d-1); + + // calculate the sum + if (index >= off) { + out[index] = in[index - off] + in[index]; + } else { + //Have to leave other elements alone + out[index] = in[index]; + } + + //Sync threads before the next depth to use proper values + __syncthreads(); + + //Swap the input and the output pointers for the next iteration + T* temp2 = in; + in = out; + out = temp2; + } + + //Make sure the output is the out pointer at the end + out = in; + +} + +template +__global__ void one_block_prefix_sum(T* in, T* out, int* size) { + + int index = threadIdx.x; //Keep it simple, only use x since arrays are 1 dimensional + + // Create shared memory + extern __shared__ T s[]; + T* in_s = &s[0]; + T* out_s = &s[*size]; + + //Start by shifting right since the calculation is going to be inclusive and we want exclusive + //Load into shared memory + if (index > 0) { + in_s[index] = in[index-1]; + } else { + in_s[index] = 0.0f; + } + __syncthreads(); + + //Calculate the max depth + int max_depth = ceil(log((float) *size)/log(2.0f)); + + //Loop over each depth + for (int d = 1; d <= max_depth; d++) { + + //Calculate the offset for the current depth + int off = pow(2.0f, d-1); + + // compute left-> or right->left + if ((d%2) == 1) { + + // calculate the sum + if (index >= off) { + out_s[index] = in_s[index - off] + in_s[index]; + } else { + //Have to leave other elements alone + out_s[index] = in_s[index]; + } + + } else { + + // calculate the sum + if (index >= off) { + in_s[index] = out_s[index - off] + out_s[index]; + } else { + //Have to leave other elements alone + in_s[index] = out_s[index]; + } + + } + + //Sync threads before the next depth to use proper values + __syncthreads(); + + } + + //Copy the correct result to global memory + if ((max_depth%2) == 1) { + out[index] = out_s[index]; + } else { + out[index] = in_s[index]; + } + +} + +template +__global__ void n_block_prefix_sum(T* in, T* out, int* size) { + + int index = blockIdx.x * blockDim.x + threadIdx.x; //can't keep it simple anymore + int s_index = threadIdx.x; + + // Create shared memory + extern __shared__ T s[]; + T* in_s = &s[0]; + T* out_s = &s[blockDim.x]; + __shared__ float lower_tile_value; + + //Load into shared memory + //Start by shifting right since the calculation is going to be inclusive and we want exclusive + if (index > 0) { + in_s[s_index] = in[index-1]; + } else { + in_s[s_index] = 0.0f; + } + __syncthreads(); + + //Calculate the max depth + int max_depth = ceil(log((float) blockDim.x)/log(2.0f)); + + //Loop over each depth + for (int d = 1; d <= max_depth; d++) { + + //Calculate the offset for the current depth + int off = pow(2.0f, d-1); + + // compute left-> or right->left + if ((d%2) == 1) { + + // calculate the sum + if (s_index >= off) { + out_s[s_index] = in_s[s_index - off] + in_s[s_index]; + } else { + //Have to leave other elements alone + out_s[s_index] = in_s[s_index]; + } + + } else { + + // calculate the sum + if (s_index >= off) { + in_s[s_index] = out_s[s_index - off] + out_s[s_index]; + } else { + //Have to leave other elements alone + in_s[s_index] = out_s[s_index]; + } + + } + + //Sync threads before the next depth to use proper values + __syncthreads(); + + } + + //Copy the correct result to global memory + if ((max_depth%2) == 1) { + out[index] = out_s[s_index]; + } else { + out[index] = in_s[s_index]; + } + + __syncthreads(); + + //Determine the number of additional loops that will be required + int kernel_calls = ceil(log((float) gridDim.x)/log(2.0f))+1; + + //Loop over the kernel calls, doing a pseudo-serial scan over the remaining layers + for (int k = 0; k < kernel_calls; k++) { + + //Swap out and in + T* temp = in; + in = out; + out = temp; + + //Load the needed value for this tile into shared memory + if (s_index == 0) { + if (blockIdx.x >= (int) pow(2.0f, k)) { + lower_tile_value = in[(blockIdx.x + 1 - (int) pow(2.0f, k))*blockDim.x - 1]; + } else { + lower_tile_value = 0.0f; + } + } + __syncthreads(); + + //Add to the output + out[index] = in[index] + lower_tile_value; + __syncthreads(); + } + +} + +template +__global__ void threshold_array(const T* in, const T* threshold, int* out) { + int index = threadIdx.x; //Keep it simple, only use x since arrays are 1 dimensional + out[index] = in[index] > *threshold ? 1 : 0; +} + +template +__global__ void compact_array(const T* in, const int* indices, const int* mask, T* out, int* size) { + int index = threadIdx.x; //Keep it simple, only use x since arrays are 1 dimensional + if (mask[index] == 1) { + out[indices[index]] = in[index]; + } + *size = (indices[*size-1] + 1); +} + +template +void gpuNaivePrefixSum(const T* in, T* out, int size) { + //Allocate data on GPU + T* in_d; + T* out_d; + int* size_d; + cudaMalloc((void**)&in_d, size*sizeof(T)); + cudaMalloc((void**)&out_d, size*sizeof(T)); + cudaMalloc((void**)&size_d, sizeof(int)); + + //Copy data to GPU + cudaMemcpy(in_d, in, size*sizeof(T), cudaMemcpyHostToDevice); + cudaMemcpy(size_d, &size, sizeof(int), cudaMemcpyHostToDevice); + + //Call the kernel + naive_prefix_sum<<<1,size>>>(in_d, out_d, size_d); + cudaDeviceSynchronize(); + + //Copy data from GPU + cudaMemcpy(out, out_d, size*sizeof(T), cudaMemcpyDeviceToHost); + + //Clear memory from GPU + cudaFree(in_d); + cudaFree(out_d); +} + +template +void gpuOneBlockPrefixSum(const T* in, T* out, int size) { + //Allocate data on GPU + T* in_d; + T* out_d; + int* size_d; + cudaMalloc((void**)&in_d, size*sizeof(T)); + cudaMalloc((void**)&out_d, size*sizeof(T)); + cudaMalloc((void**)&size_d, sizeof(int)); + + //Copy data to GPU + cudaMemcpy(in_d, in, size*sizeof(T), cudaMemcpyHostToDevice); + cudaMemcpy(size_d, &size, sizeof(int), cudaMemcpyHostToDevice); + + //Call the kernel + one_block_prefix_sum<<<1,size, 2*size*sizeof(T)>>>(in_d, out_d, size_d); + cudaDeviceSynchronize(); + + //Copy data from GPU + cudaMemcpy(out, out_d, size*sizeof(T), cudaMemcpyDeviceToHost); + + //Clear memory from GPU + cudaFree(in_d); + cudaFree(out_d); +} + +template +void gpuNBlockPrefixSum(const T* in, T* out, int size) { + //Allocate data on GPU + T* in_d; + T* out_d; + int* size_d; + cudaMalloc((void**)&in_d, size*sizeof(T)); + cudaMalloc((void**)&out_d, size*sizeof(T)); + cudaMalloc((void**)&size_d, sizeof(int)); + + //Copy data to GPU + cudaMemcpy(in_d, in, size*sizeof(T), cudaMemcpyHostToDevice); + cudaMemcpy(size_d, &size, sizeof(int), cudaMemcpyHostToDevice); + + //Determine the needed number of blocks/threads + int needed_bytes = 2*size*sizeof(T); + int threads_per_block = 16384/needed_bytes; + int n_blocks = size/threads_per_block; + + n_block_prefix_sum<<>>(in_d, out_d, size_d); + cudaDeviceSynchronize(); + + //Copy data from GPU + cudaMemcpy(out, out_d, size*sizeof(T), cudaMemcpyDeviceToHost); + + //Clear memory from GPU + cudaFree(in_d); + cudaFree(out_d); +} + +template +void gpuScatter(const T* in, const T threshold, int* out, int size) { + //Allocate data on GPU + T* in_d; + int* out_d; + int* mask_d; + T* threshold_d; + int* size_d; + cudaMalloc((void**)&in_d, size*sizeof(T)); + cudaMalloc((void**)&out_d, size*sizeof(T)); + cudaMalloc((void**)&mask_d, size*sizeof(int)); + cudaMalloc((void**)&threshold_d, sizeof(T)); + cudaMalloc((void**)&size_d, sizeof(int)); + + //Copy data to GPU + cudaMemcpy(in_d, in, size*sizeof(T), cudaMemcpyHostToDevice); + cudaMemcpy(size_d, &size, sizeof(int), cudaMemcpyHostToDevice); + cudaMemcpy(threshold_d, &threshold, sizeof(T), cudaMemcpyHostToDevice); + + // Call the thresold kernel + threshold_array<<<1,size>>>(in_d, threshold_d, mask_d); + cudaDeviceSynchronize(); + + // Call the prefix sum kernel + one_block_prefix_sum<<<1,size, 2*size*sizeof(T)>>>(mask_d, out_d, size_d); + cudaDeviceSynchronize(); + + // Copy the result back from the GPU + cudaMemcpy(out, out_d, size*sizeof(int), cudaMemcpyDeviceToHost); + + //Clear GPU Memory + cudaFree(in_d); + cudaFree(out_d); + cudaFree(mask_d); + cudaFree(threshold_d); + cudaFree(size_d); + +} + +template +int gpuCompact(const T* in, const T threshold, T* out, int size) { + //Allocate data on GPU + T* in_d; + T* out_d; + int* mask_d; + int* indices_d; + T* threshold_d; + int* size_d; + cudaMalloc((void**)&in_d, size*sizeof(T)); + cudaMalloc((void**)&out_d, size*sizeof(T)); + cudaMalloc((void**)&mask_d, size*sizeof(int)); + cudaMalloc((void**)&indices_d, size*sizeof(int)); + cudaMalloc((void**)&threshold_d, sizeof(T)); + cudaMalloc((void**)&size_d, sizeof(int)); + + //Copy data to GPU + cudaMemcpy(in_d, in, size*sizeof(T), cudaMemcpyHostToDevice); + cudaMemcpy(size_d, &size, sizeof(int), cudaMemcpyHostToDevice); + cudaMemcpy(threshold_d, &threshold, sizeof(T), cudaMemcpyHostToDevice); + + // Call the thresold kernel + threshold_array<<<1,size>>>(in_d, threshold_d, mask_d); + cudaDeviceSynchronize(); + + // Call the prefix sum kernel + one_block_prefix_sum<<<1,size, 2*size*sizeof(T)>>>(mask_d, indices_d, size_d); + cudaDeviceSynchronize(); + + // Call the compaction kernel + compact_array<<<1,size>>>(in_d, indices_d, mask_d, out_d, size_d); + + // Copy the result back from the GPU + int result; + cudaMemcpy(&result, size_d, sizeof(int), cudaMemcpyDeviceToHost); + cudaMemcpy(out, out_d, result*sizeof(T), cudaMemcpyDeviceToHost); + + //Clear GPU Memory + cudaFree(in_d); + cudaFree(out_d); + cudaFree(mask_d); + cudaFree(indices_d); + cudaFree(threshold_d); + cudaFree(size_d); + + return result; +} + +void gpuNaivePrefixSumF(const float* in, float* out, int size) { + gpuNaivePrefixSum(in, out, size); +} + +void gpuNaivePrefixSumI(const int* in, int* out, int size) { + gpuNaivePrefixSum(in, out, size); +} + +/* +void gpuOneBlockPrefixSumF(const float* in, float* out, int size) { + gpuOneBlockPrefixSum(in, out, size); +} +*/ + +void gpuOneBlockPrefixSumI(const int* in, int* out, int size) { + gpuOneBlockPrefixSum(in, out, size); +} + +void gpuNBlockPrefixSumI(const int* in, int* out, int size) { + gpuNBlockPrefixSum(in, out, size); +} + +void gpuScatterI(const int* in, const int threshold, int* out, int size) { + gpuScatter(in, threshold, out, size); +} + +void gpuScatterF(const float* in, const float threshold, int* out, int size) { + gpuScatter(in, threshold, out, size); +} + +int gpuCompactI(const int* in, const int threshold, int* out, int size) { + return gpuCompact(in, threshold, out, size); +} + +int gpuCompactF(const float* in, const float threshold, float* out, int size) { + return gpuCompact(in, threshold, out, size); +} + diff --git a/StreamCompaction/StreamCompaction/prefix_sum.h b/StreamCompaction/StreamCompaction/prefix_sum.h new file mode 100644 index 0000000..20804ab --- /dev/null +++ b/StreamCompaction/StreamCompaction/prefix_sum.h @@ -0,0 +1,53 @@ +#ifndef PREFIX_SUM_H_ +#define PREFIX_SUM_H_ + +/* Naive Implementation of GPU Prefix Sum */ +/* @param in Input array of to sum over */ +/* @param out Output array */ +/* @param size The integer size of in */ +template +void gpuNaivePrefixSum(const T* in, T* out, int size); + +/* Single Block Shared Implementation of GPU Prefix Sum */ +/* @param in Input array to sum over */ +/* @param out Output array */ +/* @param size The integer size of in */ +template +void gpuOneBlockPrefixSum(const T* in, T* out, int size); + +/* Multi Block Shared Implementation of GPU Prefix Sum */ +/* @param in Input array to sum over */ +/* @param out Output array */ +/* @param size The integer size of in */ +template +void gpuNBlockPrefixSum(const T* in, T* out, int size); + +/* Perform compaction using GPU Prefix sum */ +/* @param in Input array to scatter */ +/* @param threshold Value to threshold the input array on (using >) */ +/* @param out Output array of scattered indices */ +/* @param size The integer size of in */ +template +void gpuScatter(const T* in, const T threshold, int* out, int size); + +/* Perform compaction using GPU Prefix sum */ +/* @param in Input array to scatter */ +/* @param threshold Value to threshold the input array on (using >) */ +/* @param out Output array of condensed data meeting the threshold */ +/* @param size The integer size of in */ +/* @return The integer size of out */ +template +int gpuCompact(const T* in, const T threshold, T* out, int size); + +// Instances of the templates (need to be compiled) +void gpuNaivePrefixSumF(const float* in, float* out, int size); +void gpuNaivePrefixSumI(const int* in, int* out, int size); +//void gpuOneBlockPrefixSumF(const float* in, float* out, int size); //WHY DOESN"T CUDA LET ME COMPILE BOTH HERE?! +void gpuOneBlockPrefixSumI(const int* in, int* out, int size); +void gpuNBlockPrefixSumI(const int* in, int* out, int size); +void gpuScatterI(const int* in, const int threshold, int* out, int size); +void gpuScatterF(const float* in, const float threshold, int* out, int size); +int gpuCompactI(const int* in, const int threshold, int* out, int size); +int gpuCompactF(const float* in, const float threshold, float* out, int size); + +#endif //PREFIX_SUM_H_ diff --git a/StreamCompaction/StreamCompaction/thrust_compact.cu b/StreamCompaction/StreamCompaction/thrust_compact.cu new file mode 100644 index 0000000..ce61de8 --- /dev/null +++ b/StreamCompaction/StreamCompaction/thrust_compact.cu @@ -0,0 +1,36 @@ + +#include +#include "thrust_compact.h" + +// Predicate for Thrust +template +struct isGreaterThan5 { + __host__ __device__ bool operator() (const T& in) { + return (in > 5); + } +}; + + +int thrustCompactI(const int* in, const int threshold, int* out, int size) { + /* + // Convert the input to thrust formats + thrust::host_ptr h_ptr_in(in); + thrust::fill(h_ptr_in, h_ptr_in+N, ) + thrust::host_ptr h_ptr_out(out); + + // Copy data to the device + thrust::device_vector d_vec_in = h_vec_in; + thrust::device_vector d_vec_out = h_vec_out; + + // Execute stream compaction + thrust::device_vector::iterator d_it = thrust::copy_if(d_vec_in.begin(), d_vec_in.end(), d_vec_out.begin(), isGreaterThan5()); + + // Copy the result vector back + h_vec_out = d_vec_out; + //thrust::host_vector::iterator h_it = d_it; + + // Convert the output back from thrust formats + out = thrust::raw_pointer_cast(&h_vec_out[0]); + */ + return 0;//(h_it - h_vec_out.begin()); +} diff --git a/StreamCompaction/StreamCompaction/thrust_compact.h b/StreamCompaction/StreamCompaction/thrust_compact.h new file mode 100644 index 0000000..9d6c360 --- /dev/null +++ b/StreamCompaction/StreamCompaction/thrust_compact.h @@ -0,0 +1,15 @@ +#ifndef THRUST_COMPACT_H_ +#define THRUST_COMPACT_H_ + +// Thrust Dependencies +#include + +/* Perform compaction using GPU with Thrust libraries */ +/* @param in Input array to scatter */ +/* @param threshold Value to threshold the input array on (using >) */ +/* @param out Output array of condensed data meeting the threshold */ +/* @param size The integer size of in */ +/* @return The integer size of out */ +int thrustCompactI(const int* in, const int threshold, int* out, int size); + +#endif // THRUST_COMPACT_H_ diff --git a/StreamCompaction/StreamCompaction/timing_utils.cu b/StreamCompaction/StreamCompaction/timing_utils.cu new file mode 100644 index 0000000..bb9809f --- /dev/null +++ b/StreamCompaction/StreamCompaction/timing_utils.cu @@ -0,0 +1,32 @@ + +// CUDA Dependencies +#include + +// Project Dependencies +#include "timing_utils.h" + +//Global data +cudaEvent_t beginEvent, endEvent; + +void startTiming() { + //Add timing options + cudaEventCreate( &beginEvent ); + cudaEventCreate( &endEvent ); + + //Execute the naive prefix sum and compute the time (in milliseconds) + cudaEventRecord(beginEvent, 0); +} + +float stopTiming() { + float time; + + cudaEventRecord(endEvent, 0); + cudaEventSynchronize(endEvent); + cudaEventElapsedTime(&time, beginEvent, endEvent); + + //Cleanup timers + cudaEventDestroy(beginEvent); + cudaEventDestroy(endEvent); + + return time; +} \ No newline at end of file diff --git a/StreamCompaction/StreamCompaction/timing_utils.h b/StreamCompaction/StreamCompaction/timing_utils.h new file mode 100644 index 0000000..ffdd450 --- /dev/null +++ b/StreamCompaction/StreamCompaction/timing_utils.h @@ -0,0 +1,11 @@ +#ifndef TIMING_UTILS_H_ +#define TIMING_UTILS_H_ + +/* Function to mark the timer to start */ +void startTiming(); + +/* Function to mark the timer to end and return elapsed time */ +/* @return Time (in ms) since startTiming was called */ +float stopTiming(); + +#endif // TIMING_UTILS_H_