diff --git a/README.md b/README.md
index 98dd9a8..d079ce0 100644
--- a/README.md
+++ b/README.md
@@ -1,10 +1,98 @@
-**University of Pennsylvania, CIS 565: GPU Programming and Architecture,
-Project 1 - Flocking**
+University of Pennsylvania, CIS 565: GPU Programming and Architecture,Project 1 - Flocking
+===========================================
+* Ziyu Li
+* Tested on: Windows 7, i7-3840QM @ 2.8GHz 16GB, Nivida Quadro K4000M 4096MB (Personal Laptop)
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+## Performance Analysis
+#### Kernel Performance Measurement
 
-### (TODO: Your README)
+Start the program from release mode and disble the visualization. Next let the program runs 30 second for each test and write the frame rate measurement result every one second to a csv file which is easy to do the statistics works.There are seven different numbers of boids from 10 to 40960 for each naive, scattered uniform grid, and coherent uniform grid method. To make sure the results are accurate and avoid the GPU and CPU throttling, there is a 60 seconds cooling time between each test.
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+
+#### Results
+
++ ![](images/1.gif)
+
+**Performance Measurement under block size 128**  (the resutls below are averaged framerate in 30 seconds run)
+
+For my performance measurement, the naive approach is faster than other two spatial accelerated implementation when the number of boids is low (approx. less than 160). But after I increase the boids from 640 to 40960, the advantage of spatial accelerated implementation is much more obvious.
+
+Base on the results, but can see GPU can significantly increase the computational speed. On my benchmark, a delicate implementation on GPU can improve up to approx. 50 times faster than a naive approach.
+
+There is a graph shows the comparison between these three implementation under different number of boids. 
+
++ ![](images/comparision.PNG)
+
+ Num of boids   | Naive (fps)  | Scattered (fps)  | coherent (fps) | Max Performance Boost
+  ---|---|---|---|---
+  10 | 2613| 1188|1293| -50.52%
+  40 | 2295|1284|1269 | -44.05%
+ 160 | 1659|1179|1263 | -23.87%
+ 640 | 863 |1223|1200 | +41.71%
+2560 | 280 |892| 940  | +235.7%
+10240| 38.2|431| 439  | +1049.%
+40960| 2.32|116| 118  | +4986.%
+
+
+**Performance Meaurement under different block size** (the resutls below are averaged framerate in 30 seconds run)
+
+This performance measurement compare how block size will affect the performance. And there are three comparison benchmark under different boids size. The graph and results show below:
+
+**1. boids size: 160**
+
++ ![](images/160Boids.PNG)
+
+Block size   | Naive (fps)  | Scattered (fps)  | coherent (fps)
+  ---|---|---|---
+  64 | 1767| 1285|1273
+  128 | 1744|1271|1279
+ 256 | 1631|1284|1259
+ 512 | 1763 |1283|1252
+ 
+ **2. boids size: 1.6k**
+
+ + ![](images/1.6kBoids.PNG)
+
+ Block size   | Naive (fps)  | Scattered (fps)  | coherent (fps)
+  ---|---|---|---
+  64 | 432| 958|1018
+  128 | 430|967|1001
+ 256 | 429|970|1023
+ 512 | 429|851|905
+ 
+ **3. boids size: 16k**
+
+ + ![](images/16kBoids.PNG)
+
+ Block size   | Naive (fps)  | Scattered (fps)  | coherent (fps)
+  ---|---|---|---
+  64 | 12.60| 253|236
+  128 | 12.59|256|275
+ 256 | 12.57|243|277
+ 512 | 12.48 |237|248
+ 
+#### Questions
+
+**For each implementation, how does changing the number of boids affect performance? Why do you think this is?**
+
+For all these three implementations, more "boids" means more computation and more memory usage. So A large number of boids will spend more time on computation and data transfer, and those will lead low framerate and dropping frames.
+
+**For each implementation, how does changing the block count and block size affect performance? Why do you think this is?**
+
+The tendency shows a large block size will make the simulation slightly slower based on different performance measurements, and a smaller block count will increase the performance slightly. I guess in theory, big block size situation is suitable for the MP contains more SPs. If a block contains too many threads, it will make the scheduler much more busy to arrange the works. However, it also will easier to separate those blocks on different MPs. So I believe, there should be a certain sweet-point for program under some conditions, which achieve the most optimal solution.
+
+**For the coherent uniform grid: did you experience any performance improvements with the more coherent uniform grid? Was this the outcome you expected? Why or why not?**
+
+I expectation is getting a huge boost which benefits on coherent uniform grid.  And I have three benchmark tests under different number of boids. But actually, based on those result, most of times, the coherent uniform grid is faster but not a huge different.
+
+But for some cases, for instance, I have a performance benchmark has 160 boids, which I cannot see any improvement on performance. The reason of that is too less calculations on GPU, and rearrange the position and velocity buffer will cause a delay. Those delay will occupy a huge amount of computation time. 
+
+However, for a large amount of boids, things are different. I think memory coherent and decrease indirection on this implementation will lead the performance boost. 
+
+**Did changing cell width and checking 27 vs 8 neighboring cells affect performance? Why or why not?**
+
+I believe if the cell width is in a reasonable range the performance should almost the same.
+However, if the cell width is too small, the constructe and update the grid should occupy a huge kernel time.
+If the cell width is too big, there seems to be no reasons to use the grid to accelerate searching.
+
+For checking 27 vs 8 neighboring cells, I guess 8 neighboring should be slightly faster.
diff --git a/images/1.6kBoids.PNG b/images/1.6kBoids.PNG
new file mode 100644
index 0000000..a78351e
Binary files /dev/null and b/images/1.6kBoids.PNG differ
diff --git a/images/1.gif b/images/1.gif
new file mode 100644
index 0000000..4005a2c
Binary files /dev/null and b/images/1.gif differ
diff --git a/images/160Boids.PNG b/images/160Boids.PNG
new file mode 100644
index 0000000..de8384d
Binary files /dev/null and b/images/160Boids.PNG differ
diff --git a/images/16kBoids.PNG b/images/16kBoids.PNG
new file mode 100644
index 0000000..4bcab0d
Binary files /dev/null and b/images/16kBoids.PNG differ
diff --git a/images/comparision.PNG b/images/comparision.PNG
new file mode 100644
index 0000000..652f41a
Binary files /dev/null and b/images/comparision.PNG differ
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index fdd636d..750f0cb 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -10,5 +10,5 @@ set(SOURCE_FILES
 
 cuda_add_library(src
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_30
     )
diff --git a/src/kernel.cu b/src/kernel.cu
index aaf0fbf..74823b9 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -21,14 +21,14 @@
 * Check for CUDA errors; print and exit if there was a problem.
 */
 void checkCUDAError(const char *msg, int line = -1) {
-  cudaError_t err = cudaGetLastError();
-  if (cudaSuccess != err) {
-    if (line >= 0) {
-      fprintf(stderr, "Line %d: ", line);
-    }
-    fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString(err));
-    exit(EXIT_FAILURE);
-  }
+	cudaError_t err = cudaGetLastError();
+	if (cudaSuccess != err) {
+		if (line >= 0) {
+			fprintf(stderr, "Line %d: ", line);
+		}
+		fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString(err));
+		exit(EXIT_FAILURE);
+	}
 }
 
 
@@ -37,7 +37,7 @@ void checkCUDAError(const char *msg, int line = -1) {
 *****************/
 
 /*! Block size used for CUDA kernel launch. */
-#define blockSize 128
+#define blockSize 512
 
 // LOOK-1.2 Parameters for the boids algorithm.
 // These worked well in our reference implementation.
@@ -76,18 +76,19 @@ glm::vec3 *dev_vel2;
 // For efficient sorting and the uniform grid. These should always be parallel.
 int *dev_particleArrayIndices; // What index in dev_pos and dev_velX represents this particle?
 int *dev_particleGridIndices; // What grid cell is this particle in?
-// needed for use with thrust
+							  // needed for use with thrust
 thrust::device_ptr<int> dev_thrust_particleArrayIndices;
 thrust::device_ptr<int> dev_thrust_particleGridIndices;
 
 int *dev_gridCellStartIndices; // What part of dev_particleArrayIndices belongs
 int *dev_gridCellEndIndices;   // to this cell?
 
-// TODO-2.3 - consider what additional buffers you might need to reshuffle
-// the position and velocity data to be coherent within cells.
 
-// LOOK-2.1 - Grid parameters based on simulation parameters.
-// These are automatically computed for you in Boids::initSimulation
+glm::vec3* dev_pos_new;							   // TODO-2.3 - consider what additional buffers you might need to reshuffle
+glm::vec3* dev_vel1_new;							   // the position and velocity data to be coherent within cells.
+
+							   // LOOK-2.1 - Grid parameters based on simulation parameters.
+							   // These are automatically computed for you in Boids::initSimulation
 int gridCellCount;
 int gridSideCount;
 float gridCellWidth;
@@ -99,13 +100,13 @@ glm::vec3 gridMinimum;
 ******************/
 
 __host__ __device__ unsigned int hash(unsigned int a) {
-  a = (a + 0x7ed55d16) + (a << 12);
-  a = (a ^ 0xc761c23c) ^ (a >> 19);
-  a = (a + 0x165667b1) + (a << 5);
-  a = (a + 0xd3a2646c) ^ (a << 9);
-  a = (a + 0xfd7046c5) + (a << 3);
-  a = (a ^ 0xb55a4f09) ^ (a >> 16);
-  return a;
+	a = (a + 0x7ed55d16) + (a << 12);
+	a = (a ^ 0xc761c23c) ^ (a >> 19);
+	a = (a + 0x165667b1) + (a << 5);
+	a = (a + 0xd3a2646c) ^ (a << 9);
+	a = (a + 0xfd7046c5) + (a << 3);
+	a = (a ^ 0xb55a4f09) ^ (a >> 16);
+	return a;
 }
 
 /**
@@ -113,10 +114,10 @@ __host__ __device__ unsigned int hash(unsigned int a) {
 * Function for generating a random vec3.
 */
 __host__ __device__ glm::vec3 generateRandomVec3(float time, int index) {
-  thrust::default_random_engine rng(hash((int)(index * time)));
-  thrust::uniform_real_distribution<float> unitDistrib(-1, 1);
+	thrust::default_random_engine rng(hash((int)(index * time)));
+	thrust::uniform_real_distribution<float> unitDistrib(-1, 1);
 
-  return glm::vec3((float)unitDistrib(rng), (float)unitDistrib(rng), (float)unitDistrib(rng));
+	return glm::vec3((float)unitDistrib(rng), (float)unitDistrib(rng), (float)unitDistrib(rng));
 }
 
 /**
@@ -124,52 +125,70 @@ __host__ __device__ glm::vec3 generateRandomVec3(float time, int index) {
 * CUDA kernel for generating boids with a specified mass randomly around the star.
 */
 __global__ void kernGenerateRandomPosArray(int time, int N, glm::vec3 * arr, float scale) {
-  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
-  if (index < N) {
-    glm::vec3 rand = generateRandomVec3(time, index);
-    arr[index].x = scale * rand.x;
-    arr[index].y = scale * rand.y;
-    arr[index].z = scale * rand.z;
-  }
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index < N) {
+		glm::vec3 rand = generateRandomVec3(time, index);
+		arr[index].x = scale * rand.x;
+		arr[index].y = scale * rand.y;
+		arr[index].z = scale * rand.z;
+	}
 }
 
 /**
 * Initialize memory, update some globals
 */
 void Boids::initSimulation(int N) {
-  numObjects = N;
-  dim3 fullBlocksPerGrid((N + blockSize - 1) / blockSize);
-
-  // LOOK-1.2 - This is basic CUDA memory management and error checking.
-  // Don't forget to cudaFree in  Boids::endSimulation.
-  cudaMalloc((void**)&dev_pos, N * sizeof(glm::vec3));
-  checkCUDAErrorWithLine("cudaMalloc dev_pos failed!");
-
-  cudaMalloc((void**)&dev_vel1, N * sizeof(glm::vec3));
-  checkCUDAErrorWithLine("cudaMalloc dev_vel1 failed!");
-
-  cudaMalloc((void**)&dev_vel2, N * sizeof(glm::vec3));
-  checkCUDAErrorWithLine("cudaMalloc dev_vel2 failed!");
-
-  // LOOK-1.2 - This is a typical CUDA kernel invocation.
-  kernGenerateRandomPosArray<<<fullBlocksPerGrid, blockSize>>>(1, numObjects,
-    dev_pos, scene_scale);
-  checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!");
-
-  // LOOK-2.1 computing grid params
-  gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
-  int halfSideCount = (int)(scene_scale / gridCellWidth) + 1;
-  gridSideCount = 2 * halfSideCount;
-
-  gridCellCount = gridSideCount * gridSideCount * gridSideCount;
-  gridInverseCellWidth = 1.0f / gridCellWidth;
-  float halfGridWidth = gridCellWidth * halfSideCount;
-  gridMinimum.x -= halfGridWidth;
-  gridMinimum.y -= halfGridWidth;
-  gridMinimum.z -= halfGridWidth;
-
-  // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
-  cudaThreadSynchronize();
+	numObjects = N;
+	dim3 fullBlocksPerGrid((N + blockSize - 1) / blockSize);
+
+	// LOOK-1.2 - This is basic CUDA memory management and error checking.
+	// Don't forget to cudaFree in  Boids::endSimulation.
+	cudaMalloc((void**)&dev_pos, N * sizeof(glm::vec3));
+	checkCUDAErrorWithLine("cudaMalloc dev_pos failed!");
+
+	cudaMalloc((void**)&dev_vel1, N * sizeof(glm::vec3));
+	checkCUDAErrorWithLine("cudaMalloc dev_vel1 failed!");
+
+	cudaMalloc((void**)&dev_vel2, N * sizeof(glm::vec3));
+	checkCUDAErrorWithLine("cudaMalloc dev_vel2 failed!");
+
+	// LOOK-1.2 - This is a typical CUDA kernel invocation.
+	kernGenerateRandomPosArray << <fullBlocksPerGrid, blockSize >> >(1, numObjects,
+		dev_pos, scene_scale);
+	checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!");
+
+	// LOOK-2.1 computing grid params
+	gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
+	int halfSideCount = (int)(scene_scale / gridCellWidth) + 1;
+	gridSideCount = 2 * halfSideCount;
+
+	gridCellCount = gridSideCount * gridSideCount * gridSideCount;
+	gridInverseCellWidth = 1.0f / gridCellWidth;
+	float halfGridWidth = gridCellWidth * halfSideCount;
+	gridMinimum.x -= halfGridWidth;
+	gridMinimum.y -= halfGridWidth;
+	gridMinimum.z -= halfGridWidth;
+
+	// TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+	cudaMalloc((void**)&dev_particleGridIndices, sizeof(int) * N);
+	checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+
+	cudaMalloc((void**)&dev_particleArrayIndices, sizeof(int) * N);
+	checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+	cudaMalloc((void**)&dev_gridCellStartIndices, sizeof(int) * gridCellCount);
+	checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+
+	cudaMalloc((void**)&dev_gridCellEndIndices, sizeof(int) * gridCellCount);
+	checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+	cudaMalloc((void**)&dev_pos_new, sizeof(glm::vec3) * N);
+	checkCUDAErrorWithLine("cudaMalloc dev_pos_new failed!");
+
+	cudaMalloc((void**)&dev_vel1_new, sizeof(glm::vec3) * N);
+	checkCUDAErrorWithLine("cudaMalloc dev_vel1_new failed!");
+
+	cudaThreadSynchronize();
 }
 
 
@@ -181,41 +200,41 @@ void Boids::initSimulation(int N) {
 * Copy the boid positions into the VBO so that they can be drawn by OpenGL.
 */
 __global__ void kernCopyPositionsToVBO(int N, glm::vec3 *pos, float *vbo, float s_scale) {
-  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
 
-  float c_scale = -1.0f / s_scale;
+	float c_scale = -1.0f / s_scale;
 
-  if (index < N) {
-    vbo[4 * index + 0] = pos[index].x * c_scale;
-    vbo[4 * index + 1] = pos[index].y * c_scale;
-    vbo[4 * index + 2] = pos[index].z * c_scale;
-    vbo[4 * index + 3] = 1.0f;
-  }
+	if (index < N) {
+		vbo[4 * index + 0] = pos[index].x * c_scale;
+		vbo[4 * index + 1] = pos[index].y * c_scale;
+		vbo[4 * index + 2] = pos[index].z * c_scale;
+		vbo[4 * index + 3] = 1.0f;
+	}
 }
 
 __global__ void kernCopyVelocitiesToVBO(int N, glm::vec3 *vel, float *vbo, float s_scale) {
-  int index = threadIdx.x + (blockIdx.x * blockDim.x);
-
-  if (index < N) {
-    vbo[4 * index + 0] = vel[index].x + 0.3f;
-    vbo[4 * index + 1] = vel[index].y + 0.3f;
-    vbo[4 * index + 2] = vel[index].z + 0.3f;
-    vbo[4 * index + 3] = 1.0f;
-  }
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+
+	if (index < N) {
+		vbo[4 * index + 0] = vel[index].x + 0.3f;
+		vbo[4 * index + 1] = vel[index].y + 0.3f;
+		vbo[4 * index + 2] = vel[index].z + 0.3f;
+		vbo[4 * index + 3] = 1.0f;
+	}
 }
 
 /**
 * Wrapper for call to the kernCopyboidsToVBO CUDA kernel.
 */
 void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities) {
-  dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
 
-  kernCopyPositionsToVBO << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_pos, vbodptr_positions, scene_scale);
-  kernCopyVelocitiesToVBO << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_vel1, vbodptr_velocities, scene_scale);
+	kernCopyPositionsToVBO << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_pos, vbodptr_positions, scene_scale);
+	kernCopyVelocitiesToVBO << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_vel1, vbodptr_velocities, scene_scale);
 
-  checkCUDAErrorWithLine("copyBoidsToVBO failed!");
+	checkCUDAErrorWithLine("copyBoidsToVBO failed!");
 
-  cudaThreadSynchronize();
+	cudaThreadSynchronize();
 }
 
 
@@ -230,10 +249,60 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 * in the `pos` and `vel` arrays.
 */
 __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
-  // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
-  // Rule 2: boids try to stay a distance d away from each other
-  // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+	// Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+	// Rule 2: boids try to stay a distance d away from each other
+	// Rule 3: boids try to match the speed of surrounding boids
+
+	glm::vec3 vel_coheion(0.0f, 0.0f, 0.0f);
+	glm::vec3 vel_speration(0.0f, 0.0f, 0.0f);
+	glm::vec3 vel_alignment(0.0f, 0.0f, 0.0f);
+
+	int neighbors_coheion = 0;
+	int neighbors_alignment = 0;
+
+	glm::vec3 pos_self = pos[iSelf];
+
+	// foreach boid
+	for (int i = 0; i < N; i++) {
+		if (i != iSelf) {
+			glm::vec3 pos_i = pos[i];
+			glm::vec3 vel_i = vel[i];
+
+			float dist = glm::length(pos_i - pos_self);
+
+			// cohesion
+			if (dist < rule1Distance) {
+				vel_coheion += pos_i;
+				neighbors_coheion++;
+			}
+
+			// seperation
+			if (dist < rule2Distance && dist < 100) {
+				vel_speration -= (pos_i - pos_self);
+			}
+
+			// alignment
+			if (dist < rule3Distance) {
+				vel_alignment += vel_i;
+				neighbors_alignment++;
+			}
+		}
+	}
+
+	if (neighbors_coheion) {
+		vel_coheion /= neighbors_coheion;
+		vel_coheion = (vel_coheion - pos_self) * rule1Scale;
+	}
+
+	vel_speration *= rule2Scale;
+
+	if (neighbors_alignment) {
+		vel_alignment /= neighbors_alignment;
+		vel_alignment *= rule3Scale;
+	}
+
+	return vel_alignment + vel_speration + vel_coheion;
+
 }
 
 /**
@@ -241,10 +310,25 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
 * For each of the `N` bodies, update its position based on its current velocity.
 */
 __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
-  glm::vec3 *vel1, glm::vec3 *vel2) {
-  // Compute a new velocity based on pos and vel1
-  // Clamp the speed
-  // Record the new velocity into vel2. Question: why NOT vel1?
+	glm::vec3 *vel1, glm::vec3 *vel2) {
+	// Compute a new velocity based on pos and vel1
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
+
+	glm::vec3 temp_v;
+
+	temp_v = vel1[index];
+	temp_v += computeVelocityChange(N, index, pos, vel1);
+
+	// Clamp the speed
+	if (glm::length(temp_v) > maxSpeed) {
+		temp_v = temp_v / glm::length(temp_v) * maxSpeed;
+	}
+
+	// Record the new velocity into vel2. Question: why NOT vel1?
+	vel2[index] = temp_v;
 }
 
 /**
@@ -252,24 +336,24 @@ __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
 * For each of the `N` bodies, update its position based on its current velocity.
 */
 __global__ void kernUpdatePos(int N, float dt, glm::vec3 *pos, glm::vec3 *vel) {
-  // Update position by velocity
-  int index = threadIdx.x + (blockIdx.x * blockDim.x);
-  if (index >= N) {
-    return;
-  }
-  glm::vec3 thisPos = pos[index];
-  thisPos += vel[index] * dt;
-
-  // Wrap the boids around so we don't lose them
-  thisPos.x = thisPos.x < -scene_scale ? scene_scale : thisPos.x;
-  thisPos.y = thisPos.y < -scene_scale ? scene_scale : thisPos.y;
-  thisPos.z = thisPos.z < -scene_scale ? scene_scale : thisPos.z;
-
-  thisPos.x = thisPos.x > scene_scale ? -scene_scale : thisPos.x;
-  thisPos.y = thisPos.y > scene_scale ? -scene_scale : thisPos.y;
-  thisPos.z = thisPos.z > scene_scale ? -scene_scale : thisPos.z;
-
-  pos[index] = thisPos;
+	// Update position by velocity
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
+	glm::vec3 thisPos = pos[index];
+	thisPos += vel[index] * dt;
+
+	// Wrap the boids around so we don't lose them
+	thisPos.x = thisPos.x < -scene_scale ? scene_scale : thisPos.x;
+	thisPos.y = thisPos.y < -scene_scale ? scene_scale : thisPos.y;
+	thisPos.z = thisPos.z < -scene_scale ? scene_scale : thisPos.z;
+
+	thisPos.x = thisPos.x > scene_scale ? -scene_scale : thisPos.x;
+	thisPos.y = thisPos.y > scene_scale ? -scene_scale : thisPos.y;
+	thisPos.z = thisPos.z > scene_scale ? -scene_scale : thisPos.z;
+
+	pos[index] = thisPos;
 }
 
 // LOOK-2.1 Consider this method of computing a 1D index from a 3D grid index.
@@ -279,181 +363,499 @@ __global__ void kernUpdatePos(int N, float dt, glm::vec3 *pos, glm::vec3 *vel) {
 //            for(y)
 //             for(z)? Or some other order?
 __device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) {
-  return x + y * gridResolution + z * gridResolution * gridResolution;
+	return x + y * gridResolution + z * gridResolution * gridResolution;
 }
 
 __global__ void kernComputeIndices(int N, int gridResolution,
-  glm::vec3 gridMin, float inverseCellWidth,
-  glm::vec3 *pos, int *indices, int *gridIndices) {
-    // TODO-2.1
-    // - Label each boid with the index of its grid cell.
-    // - Set up a parallel array of integer indices as pointers to the actual
-    //   boid data in pos and vel1/vel2
+	glm::vec3 gridMin, float inverseCellWidth,
+	glm::vec3 *pos, int *indices, int *gridIndices) {
+	// TODO-2.1
+	// - Label each boid with the index of its grid cell.
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index >= N) {
+		return;
+	}
+
+	glm::vec3 p = pos[index];
+
+	int x = (int)((p.x - gridMin.x) * inverseCellWidth);
+	int y = (int)((p.y - gridMin.y) * inverseCellWidth);
+	int z = (int)((p.z - gridMin.z) * inverseCellWidth);
+
+	gridIndices[index] = gridIndex3Dto1D(x, y, z, gridResolution);
+
+	// - Set up a parallel array of integer indices as pointers to the actual
+	//   boid data in pos and vel1/vel2
+
+	indices[index] = index;
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
 //          does not enclose any boids
 __global__ void kernResetIntBuffer(int N, int *intBuffer, int value) {
-  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
-  if (index < N) {
-    intBuffer[index] = value;
-  }
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index < N) {
+		intBuffer[index] = value;
+	}
 }
 
 __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
-  int *gridCellStartIndices, int *gridCellEndIndices) {
-  // TODO-2.1
-  // Identify the start point of each cell in the gridIndices array.
-  // This is basically a parallel unrolling of a loop that goes
-  // "this index doesn't match the one before it, must be a new cell!"
+	int *gridCellStartIndices, int *gridCellEndIndices) {
+	// TODO-2.1
+	// Identify the start point of each cell in the gridIndices array.
+	// This is basically a parallel unrolling of a loop that goes
+	// "this index doesn't match the one before it, must be a new cell!"
+
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index >= N) {
+		return;
+	}
+
+	int g_idx = particleGridIndices[index];
+
+	if (!index) {
+		gridCellStartIndices[g_idx] = 0;
+	}
+	else if (index == N - 1) {
+		gridCellEndIndices[g_idx] = N - 1;
+	}
+	else {
+		int g_idx_nxt = particleGridIndices[index + 1];
+		if (g_idx_nxt != g_idx) {
+			gridCellEndIndices[g_idx] = index;
+			gridCellStartIndices[g_idx_nxt] = index + 1;
+		}
+	}
+}
+
+__global__ void kernSortPosVel(int N, int *indices, glm::vec3 *pos, glm::vec3 *pos_new, glm::vec3 *vel1, glm::vec3 *vel1_new) {
+	int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (idx > N) {
+		return;
+	}
+
+	int i = indices[idx];
+	pos_new[idx] = pos[i];
+	vel1_new[idx] = vel1[i];
+
+	__syncthreads();
+
+	pos[idx] = pos_new[idx];
+	vel1[idx] = vel1_new[idx];
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
-  int N, int gridResolution, glm::vec3 gridMin,
-  float inverseCellWidth, float cellWidth,
-  int *gridCellStartIndices, int *gridCellEndIndices,
-  int *particleArrayIndices,
-  glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
-  // the number of boids that need to be checked.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+	int N, int gridResolution, glm::vec3 gridMin,
+	float inverseCellWidth, float cellWidth,
+	int *gridCellStartIndices, int *gridCellEndIndices,
+	int *particleArrayIndices,
+	glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
+	// TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
+	// the number of boids that need to be checked.
+	int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (idx >= N) {
+		return;
+	}
+
+	// - Identify the grid cell that this particle is in
+	int p_idx;
+	int cx, cy, cz;
+	int neighbors_coheion, neighbors_alignment;
+
+	glm::vec3 temp;
+	glm::vec3 p_this_pos;
+	glm::vec3 vel_coheion(0.0f, 0.0f, 0.0f);
+	glm::vec3 vel_speration(0.0f, 0.0f, 0.0f);
+	glm::vec3 vel_alignment(0.0f, 0.0f, 0.0f);
+
+	neighbors_coheion = 0;
+	neighbors_alignment = 0;
+	p_idx = particleArrayIndices[idx];
+	p_this_pos = pos[p_idx];
+
+	temp = (pos[p_idx] - gridMin) * inverseCellWidth;
+	cx = (int)((temp.x - (int)temp.x) < 0.5f ? temp.x - 1 : temp.x); // most left
+	cy = (int)((temp.y - (int)temp.y) < 0.5f ? temp.y - 1 : temp.y);
+	cz = (int)((temp.z - (int)temp.z) < 0.5f ? temp.z - 1 : temp.z);
+
+	// - Identify which cells may contain neighbors. This isn't always 8.
+	// - For each cell, read the start/end indices in the boid pointer array.
+
+	for (int i = cx; i < cx + 2; i++) {
+		for (int j = cy; j < cy + 2; j++) {
+			for (int k = cz; k < cz + 2; k++) {
+				if (i < 0 || i > gridResolution - 1 ||
+					j < 0 || j > gridResolution - 1 ||
+					k < 0 || k > gridResolution - 1) continue;
+
+				int g_idx = gridIndex3Dto1D(i, j, k, gridResolution);
+				int p_start_idx = gridCellStartIndices[g_idx];
+				int p_end_idx = gridCellEndIndices[g_idx];
+
+				if (p_start_idx < 0 && p_end_idx < 0) continue;
+
+				for (int p_nbr_idx = p_start_idx; p_nbr_idx <= p_end_idx; p_nbr_idx++) {
+					int p_nbr_i = particleArrayIndices[p_nbr_idx];
+
+
+					if (p_nbr_i != p_idx) {
+						glm::vec3 p_nbr_pos = pos[p_nbr_i];
+						glm::vec3 p_nbr_vel = vel1[p_nbr_i];
+						float len = glm::length(p_nbr_pos - p_this_pos);
+						
+						if (len < rule1Distance) {
+							vel_coheion += p_nbr_pos;
+							neighbors_coheion++;
+						}
+						
+						if (len < rule2Distance && len < 100) {
+							vel_speration -= (p_nbr_pos - p_this_pos);
+						}
+
+						if (len < rule3Distance) {
+							vel_alignment += p_nbr_vel;
+							neighbors_alignment++;
+						}
+						
+					}
+				}
+			}
+		}
+	}
+
+	if (neighbors_coheion) {
+		vel_coheion /= neighbors_coheion;
+		vel_coheion = (vel_coheion - p_this_pos) * rule1Scale;
+	}
+
+	vel_speration *= rule2Scale;
+
+	if (neighbors_alignment) {
+		vel_alignment /= neighbors_alignment;
+		vel_alignment *= rule3Scale;
+	}
+
+
+	glm::vec3 v = vel_coheion + vel_speration + vel_alignment + vel1[p_idx];
+	float speed = glm::length(v);
+
+	if (speed > maxSpeed) {
+		v = v * maxSpeed / speed;
+	}
+
+	vel2[p_idx] = v;
+
+	// - Access each boid in the cell and compute velocity change from
+	//   the boids rules, if this boid is within the neighborhood distance.
+	// - Clamp the speed change before putting the new speed in vel2
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
-  int N, int gridResolution, glm::vec3 gridMin,
-  float inverseCellWidth, float cellWidth,
-  int *gridCellStartIndices, int *gridCellEndIndices,
-  glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
-  // except with one less level of indirection.
-  // This should expect gridCellStartIndices and gridCellEndIndices to refer
-  // directly to pos and vel1.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  //   DIFFERENCE: For best results, consider what order the cells should be
-  //   checked in to maximize the memory benefits of reordering the boids data.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+	int N, int gridResolution, glm::vec3 gridMin,
+	float inverseCellWidth, float cellWidth,
+	int *gridCellStartIndices, int *gridCellEndIndices,
+	glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
+	// TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
+	// except with one less level of indirection.
+	// This should expect gridCellStartIndices and gridCellEndIndices to refer
+	// directly to pos and vel1.
+	// - Identify the grid cell that this particle is in
+	// - Identify which cells may contain neighbors. This isn't always 8.
+	// - For each cell, read the start/end indices in the boid pointer array.
+	//   DIFFERENCE: For best results, consider what order the cells should be
+	//   checked in to maximize the memory benefits of reordering the boids data.
+	// - Access each boid in the cell and compute velocity change from
+	//   the boids rules, if this boid is within the neighborhood distance.
+	// - Clamp the speed change before putting the new speed in vel2
+
+
+	int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (idx >= N) {
+		return;
+	}
+
+	// - Identify the grid cell that this particle is in
+	int p_idx;
+	int cx, cy, cz;
+	int neighbors_coheion, neighbors_alignment;
+	glm::vec3 temp;
+	glm::vec3 p_this_pos;
+	glm::vec3 vel_coheion(0.0f, 0.0f, 0.0f);
+	glm::vec3 vel_speration(0.0f, 0.0f, 0.0f);
+	glm::vec3 vel_alignment(0.0f, 0.0f, 0.0f);
+
+	p_idx = idx;
+	temp = (pos[p_idx] - gridMin) * inverseCellWidth;
+	cx = (int)((temp.x - (int)temp.x) < 0.5f ? temp.x - 1 : temp.x); // most left
+	cy = (int)((temp.y - (int)temp.y) < 0.5f ? temp.y - 1 : temp.y);
+	cz = (int)((temp.z - (int)temp.z) < 0.5f ? temp.z - 1 : temp.z);
+
+	// - Identify which cells may contain neighbors. This isn't always 8.
+	// - For each cell, read the start/end indices in the boid pointer array.
+
+	neighbors_coheion = 0;
+	neighbors_alignment = 0;
+	p_this_pos = pos[p_idx];
+
+	for (int i = cx; i < cx + 2; i++) {
+		for (int j = cy; j < cy + 2; j++) {
+			for (int k = cz; k < cz + 2; k++) {
+				if (i < 0 || i > gridResolution - 1 ||
+					j < 0 || j > gridResolution - 1 ||
+					k < 0 || k > gridResolution - 1) continue;
+
+				int g_idx = gridIndex3Dto1D(i, j, k, gridResolution);
+				int p_start_idx = gridCellStartIndices[g_idx];
+				int p_end_idx = gridCellEndIndices[g_idx];
+
+				if (p_start_idx < 0 && p_end_idx < 0) continue;
+
+				for (int p_nbr_idx = p_start_idx; p_nbr_idx <= p_end_idx; p_nbr_idx++) {
+					int p_nbr_i = p_nbr_idx;
+
+
+					if (p_nbr_i != p_idx) {
+						glm::vec3 p_nbr_pos = pos[p_nbr_i];
+						glm::vec3 p_nbr_vel = vel1[p_nbr_i];
+						float len = glm::length(p_nbr_pos - p_this_pos);
+
+						if (len < rule1Distance) {
+							vel_coheion += p_nbr_pos;
+							neighbors_coheion++;
+						}
+
+						if (len < rule2Distance && len < 100) {
+							vel_speration -= (p_nbr_pos - p_this_pos);
+						}
+
+						if (len < rule3Distance) {
+							vel_alignment += p_nbr_vel;
+							neighbors_alignment++;
+						}
+
+					}
+				}
+			}
+		}
+	}
+
+	if (neighbors_coheion) {
+		vel_coheion /= neighbors_coheion;
+		vel_coheion = (vel_coheion - p_this_pos) * rule1Scale;
+	}
+
+	vel_speration *= rule2Scale;
+
+	if (neighbors_alignment) {
+		vel_alignment /= neighbors_alignment;
+		vel_alignment *= rule3Scale;
+	}
+
+
+	glm::vec3 v = vel_coheion + vel_speration + vel_alignment + vel1[p_idx];
+	float speed = glm::length(v);
+
+	if (speed > maxSpeed) {
+		v = v * maxSpeed / speed;
+	}
+
+	vel2[p_idx] = v;
+
+	// - Access each boid in the cell and compute velocity change from
+	//   the boids rules, if this boid is within the neighborhood distance.
+	// - Clamp the speed change before putting the new speed in vel2
+
 }
 
 /**
 * Step the entire N-body simulation by `dt` seconds.
 */
 void Boids::stepSimulationNaive(float dt) {
-  // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
-  // TODO-1.2 ping-pong the velocity buffers
+	// TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
+	// TODO-1.2 ping-pong the velocity buffers
+
+	dim3 BlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+	kernUpdateVelocityBruteForce << <BlocksPerGrid, threadsPerBlock >> >(numObjects, dev_pos, dev_vel1, dev_vel2);
+	kernUpdatePos << <BlocksPerGrid, threadsPerBlock >> >(numObjects, dt, dev_pos, dev_vel2);
+
+	std::swap(dev_vel1, dev_vel2);
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
-  // TODO-2.1
-  // Uniform Grid Neighbor search using Thrust sort.
-  // In Parallel:
-  // - label each particle with its array index as well as its grid index.
-  //   Use 2x width grids.
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed
+	// TODO-2.1
+	// Uniform Grid Neighbor search using Thrust sort.
+	// In Parallel:
+	// - label each particle with its array index as well as its grid index.
+	//   Use 2x width grids.
+	// - Unstable key sort using Thrust. A stable sort isn't necessary, but you
+	//   are welcome to do a performance comparison.
+	// - Naively unroll the loop for finding the start and end indices of each
+	//   cell's data pointers in the array of boid indices
+	// - Perform velocity updates using neighbor search
+	// - Update positions
+	// - Ping-pong buffers as needed
+
+	dim3 BlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+	kernComputeIndices << <BlocksPerGrid, threadsPerBlock >> >(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+
+	dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+	dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+
+	thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+
+	dim3 BlocksPerGridForCells((gridCellCount + blockSize - 1) / blockSize);
+	kernResetIntBuffer << <BlocksPerGridForCells, blockSize >> >(gridCellCount, dev_gridCellStartIndices, -1);
+	kernResetIntBuffer << <BlocksPerGridForCells, blockSize >> >(gridCellCount, dev_gridCellEndIndices, -1);
+
+	kernIdentifyCellStartEnd << <BlocksPerGrid, threadsPerBlock >> >(numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+
+	kernUpdateVelNeighborSearchScattered << <BlocksPerGrid, threadsPerBlock >> >(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth,
+		gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices,
+		dev_particleArrayIndices, dev_pos, dev_vel1, dev_vel2);
+
+	kernUpdatePos << <BlocksPerGrid, threadsPerBlock >> >(numObjects, dt, dev_pos, dev_vel2);
+
+	std::swap(dev_vel1, dev_vel2);
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
-  // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
-  // Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
-  // In Parallel:
-  // - Label each particle with its array index as well as its grid index.
-  //   Use 2x width grids
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
-  //   the particle data in the simulation array.
-  //   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
-  // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+
+	dim3 BlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+	kernComputeIndices<<<BlocksPerGrid, threadsPerBlock>>>(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+
+	//dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+	//thrust::device_ptr<glm::vec3> dev_thrust_pos(dev_pos);
+	//thrust::device_ptr<glm::vec3> dev_thrust_vel1(dev_vel1);
+
+	//thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, thrust::make_zip_iterator(thrust::make_tuple(dev_thrust_pos, dev_thrust_vel1)));
+
+	dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+	dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+
+	thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+
+	dim3 BlocksPerGridForCells((gridCellCount + blockSize - 1) / blockSize);
+	kernResetIntBuffer << <BlocksPerGridForCells, blockSize >> >(gridCellCount, dev_gridCellStartIndices, -1);
+	kernResetIntBuffer << <BlocksPerGridForCells, blockSize >> >(gridCellCount, dev_gridCellEndIndices, -1);
+
+	kernIdentifyCellStartEnd << <BlocksPerGrid, threadsPerBlock >> >(numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+	kernSortPosVel << <BlocksPerGrid, threadsPerBlock >> > (numObjects, dev_particleArrayIndices, dev_pos, dev_pos_new, dev_vel1, dev_vel1_new);
+
+	kernUpdateVelNeighborSearchCoherent << <BlocksPerGrid, threadsPerBlock >> >(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth,
+		gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices, dev_pos, dev_vel1, dev_vel2);
+
+	kernUpdatePos <<<BlocksPerGrid, threadsPerBlock >> >(numObjects, dt, dev_pos, dev_vel2);
+
+	std::swap(dev_vel1, dev_vel2);
 }
 
 void Boids::endSimulation() {
-  cudaFree(dev_vel1);
-  cudaFree(dev_vel2);
-  cudaFree(dev_pos);
+	cudaFree(dev_vel1);
+	cudaFree(dev_vel2);
+	cudaFree(dev_pos);
+
+	cudaFree(dev_particleArrayIndices);
+	cudaFree(dev_particleGridIndices);
+	cudaFree(dev_gridCellStartIndices);
+	cudaFree(dev_gridCellEndIndices);
+
+	cudaFree(dev_pos_new);
+	cudaFree(dev_vel1_new);
 
-  // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+	// TODO-2.1 TODO-2.3 - Free any additional buffers here.
 }
 
 void Boids::unitTest() {
-  // LOOK-1.2 Feel free to write additional tests here.
-
-  // test unstable sort
-  int *dev_intKeys;
-  int *dev_intValues;
-  int N = 10;
-
-  int *intKeys = new int[N];
-  int *intValues = new int[N];
-
-  intKeys[0] = 0; intValues[0] = 0;
-  intKeys[1] = 1; intValues[1] = 1;
-  intKeys[2] = 0; intValues[2] = 2;
-  intKeys[3] = 3; intValues[3] = 3;
-  intKeys[4] = 0; intValues[4] = 4;
-  intKeys[5] = 2; intValues[5] = 5;
-  intKeys[6] = 2; intValues[6] = 6;
-  intKeys[7] = 0; intValues[7] = 7;
-  intKeys[8] = 5; intValues[8] = 8;
-  intKeys[9] = 6; intValues[9] = 9;
-
-  cudaMalloc((void**)&dev_intKeys, N * sizeof(int));
-  checkCUDAErrorWithLine("cudaMalloc dev_intKeys failed!");
-
-  cudaMalloc((void**)&dev_intValues, N * sizeof(int));
-  checkCUDAErrorWithLine("cudaMalloc dev_intValues failed!");
-
-  dim3 fullBlocksPerGrid((N + blockSize - 1) / blockSize);
-
-  std::cout << "before unstable sort: " << std::endl;
-  for (int i = 0; i < N; i++) {
-    std::cout << "  key: " << intKeys[i];
-    std::cout << " value: " << intValues[i] << std::endl;
-  }
-
-  // How to copy data to the GPU
-  cudaMemcpy(dev_intKeys, intKeys, sizeof(int) * N, cudaMemcpyHostToDevice);
-  cudaMemcpy(dev_intValues, intValues, sizeof(int) * N, cudaMemcpyHostToDevice);
-
-  // Wrap device vectors in thrust iterators for use with thrust.
-  thrust::device_ptr<int> dev_thrust_keys(dev_intKeys);
-  thrust::device_ptr<int> dev_thrust_values(dev_intValues);
-  // LOOK-2.1 Example for using thrust::sort_by_key
-  thrust::sort_by_key(dev_thrust_keys, dev_thrust_keys + N, dev_thrust_values);
-
-  // How to copy data back to the CPU side from the GPU
-  cudaMemcpy(intKeys, dev_intKeys, sizeof(int) * N, cudaMemcpyDeviceToHost);
-  cudaMemcpy(intValues, dev_intValues, sizeof(int) * N, cudaMemcpyDeviceToHost);
-  checkCUDAErrorWithLine("memcpy back failed!");
-
-  std::cout << "after unstable sort: " << std::endl;
-  for (int i = 0; i < N; i++) {
-    std::cout << "  key: " << intKeys[i];
-    std::cout << " value: " << intValues[i] << std::endl;
-  }
-
-  // cleanup
-  delete[] intKeys;
-  delete[] intValues;
-  cudaFree(dev_intKeys);
-  cudaFree(dev_intValues);
-  checkCUDAErrorWithLine("cudaFree failed!");
-  return;
+	// LOOK-1.2 Feel free to write additional tests here.
+
+	// test unstable sort
+	int *dev_intKeys;
+	int *dev_intValues;
+	int *dev_intValues1;
+	int N = 10;
+
+	int *intKeys = new int[N];
+	int *intValues = new int[N];
+	int *intValues1 = new int[N];
+
+
+	intKeys[0] = 0; intValues[0] = 0; 
+	intKeys[1] = 1; intValues[1] = 1;
+	intKeys[2] = 0; intValues[2] = 2;
+	intKeys[3] = 3; intValues[3] = 3;
+	intKeys[4] = 0; intValues[4] = 4;
+	intKeys[5] = 2; intValues[5] = 5;
+	intKeys[6] = 2; intValues[6] = 6;
+	intKeys[7] = 0; intValues[7] = 7;
+	intKeys[8] = 5; intValues[8] = 8;
+	intKeys[9] = 6; intValues[9] = 9;
+
+	intValues1[0] = 9;
+	intValues1[1] = 8;
+	intValues1[2] = 7;
+	intValues1[3] = 6;
+	intValues1[4] = 5;
+	intValues1[5] = 4;
+	intValues1[6] = 3;
+	intValues1[7] = 2;
+	intValues1[8] = 1;
+	intValues1[9] = 0;
+
+
+
+	cudaMalloc((void**)&dev_intKeys, N * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_intKeys failed!");
+
+	cudaMalloc((void**)&dev_intValues, N * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_intValues failed!");
+
+	cudaMalloc((void**)&dev_intValues1, N * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_intValues failed!");
+
+	dim3 fullBlocksPerGrid((N + blockSize - 1) / blockSize);
+
+	std::cout << "before unstable sort: " << std::endl;
+	for (int i = 0; i < N; i++) {
+		std::cout << "  key: " << intKeys[i];
+		std::cout << " value: " << intValues[i];
+		std::cout << " value1: " << intValues1[i] << std::endl;
+	}
+
+	// How to copy data to the GPU
+	cudaMemcpy(dev_intKeys, intKeys, sizeof(int) * N, cudaMemcpyHostToDevice);
+	cudaMemcpy(dev_intValues, intValues, sizeof(int) * N, cudaMemcpyHostToDevice);
+	cudaMemcpy(dev_intValues1, intValues1, sizeof(int) * N, cudaMemcpyHostToDevice);
+
+	// Wrap device vectors in thrust iterators for use with thrust.
+	thrust::device_ptr<int> dev_thrust_keys(dev_intKeys);
+	thrust::device_ptr<int> dev_thrust_values(dev_intValues);
+	thrust::device_ptr<int> dev_thrust_values1(dev_intValues1);
+	// LOOK-2.1 Example for using thrust::sort_by_key
+	//thrust::sort_by_key(dev_thrust_keys, dev_thrust_keys + N, dev_thrust_values);
+
+	thrust::sort_by_key(dev_thrust_keys, dev_thrust_keys + N, thrust::make_zip_iterator(thrust::make_tuple(dev_thrust_values, dev_thrust_values1)));
+
+	// How to copy data back to the CPU side from the GPU
+	cudaMemcpy(intKeys, dev_intKeys, sizeof(int) * N, cudaMemcpyDeviceToHost);
+	cudaMemcpy(intValues, dev_intValues, sizeof(int) * N, cudaMemcpyDeviceToHost);
+	cudaMemcpy(intValues1, dev_intValues1, sizeof(int) * N, cudaMemcpyDeviceToHost);
+	checkCUDAErrorWithLine("memcpy back failed!");
+
+	std::cout << "after unstable sort: " << std::endl;
+	for (int i = 0; i < N; i++) {
+		std::cout << "  key: " << intKeys[i];
+		std::cout << " value: " << intValues[i];
+		std::cout << " value1: " << intValues1[i] << std::endl;
+	}
+
+	// cleanup
+	delete[] intKeys;
+	delete[] intValues;
+	cudaFree(dev_intKeys);
+	cudaFree(dev_intValues);
+	checkCUDAErrorWithLine("cudaFree failed!");
+	return;
 }
diff --git a/src/main.cpp b/src/main.cpp
index a29471d..e3b6949 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -7,25 +7,30 @@
 */
 
 #include "main.hpp"
-
+#include <fstream>
 // ================
 // Configuration
 // ================
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
 #define VISUALIZE 1
-#define UNIFORM_GRID 0
+#define UNIFORM_GRID 1
 #define COHERENT_GRID 0
+#define MEASURE_KERNEL_TIME 0
+#define BENCHMARK 0
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
-const int N_FOR_VIS = 5000;
+const int N_FOR_VIS = 50000;
 const float DT = 0.2f;
 
+float ktime = -1;
+std::ofstream outputFile;
+
 /**
 * C main function.
 */
 int main(int argc, char* argv[]) {
-  projectName = "565 CUDA Intro: Boids";
+  projectName = "Ziyu Li - Project 1 - CUDA: Boids";
 
   if (init(argc, argv)) {
     mainLoop();
@@ -67,6 +72,9 @@ bool init(int argc, char **argv) {
   ss << projectName << " [SM " << major << "." << minor << " " << deviceProp.name << "]";
   deviceName = ss.str();
 
+  // Write Benchmark to file
+  outputFile.open("benchmark.txt");
+
   // Window setup stuff
   glfwSetErrorCallback(errorCallback);
 
@@ -198,6 +206,14 @@ void initShaders(GLuint * program) {
     cudaGLMapBufferObject((void**)&dptrVertPositions, boidVBO_positions);
     cudaGLMapBufferObject((void**)&dptrVertVelocities, boidVBO_velocities);
 
+#if MEASURE_KERNEL_TIME
+	ktime = 0.0f;
+	cudaEvent_t start, stop;
+	cudaEventCreate(&start);
+	cudaEventCreate(&stop);
+	cudaEventRecord(start, 0);
+#endif
+
     // execute the kernel
     #if UNIFORM_GRID && COHERENT_GRID
     Boids::stepSimulationCoherentGrid(DT);
@@ -210,6 +226,15 @@ void initShaders(GLuint * program) {
     #if VISUALIZE
     Boids::copyBoidsToVBO(dptrVertPositions, dptrVertVelocities);
     #endif
+
+#if MEASURE_KERNEL_TIME
+	cudaEventRecord(stop, 0);
+	cudaEventSynchronize(stop);
+	cudaEventElapsedTime(&ktime, start, stop);
+	cudaEventDestroy(start);
+	cudaEventDestroy(stop);
+#endif
+
     // unmap buffer object
     cudaGLUnmapBufferObject(boidVBO_positions);
     cudaGLUnmapBufferObject(boidVBO_velocities);
@@ -220,6 +245,10 @@ void initShaders(GLuint * program) {
     double timebase = 0;
     int frame = 0;
 
+	double avgKernelTime = 0; // Kernel Time Measurement
+	double sumKernelTime = 0; 
+	int oldtime = -1;
+
     Boids::unitTest(); // LOOK-1.2 We run some basic example code to make sure
                        // your CUDA development setup is ready to go.
 
@@ -228,20 +257,37 @@ void initShaders(GLuint * program) {
 
       frame++;
       double time = glfwGetTime();
+	  
+	  sumKernelTime += (double)ktime;
 
       if (time - timebase > 1.0) {
+		avgKernelTime = sumKernelTime / frame;
         fps = frame / (time - timebase);
         timebase = time;
         frame = 0;
+		sumKernelTime = 0;
       }
 
+#if BENCHMARK
+	  if (oldtime != (int)time) {
+		  if ((int)time) {
+			  //outputFile << (int)time << "," << fps << ", " << avgKernelTime << std::endl;
+			  outputFile << fps << std::endl;
+		  }
+		  else {
+			  //outputFile << "Run" << "," << "FPS" << ", " << "Kernel Time" << std::endl;
+		  }
+		  oldtime++;
+	  }
+#endif
       runCUDA();
 
       std::ostringstream ss;
       ss << "[";
-      ss.precision(1);
+      ss.precision(3);
       ss << std::fixed << fps;
-      ss << " fps] " << deviceName;
+      ss << " fps] ";
+	  ss << " [kernel time: " << avgKernelTime << "ms] " << deviceName;
       glfwSetWindowTitle(window, ss.str().c_str());
 
       glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);