Wednesday, March 5, 2014

CUDA for 3D Collision Detection

Having been working on a pretty simple physics engine in c++ for the past few months, I was starting to reach the limit of my CPU in terms of number of simulated bodies, this limit being around 150 bodies. I began thinking that maybe there would be a better way to deal with detecting and resolving contacts than serially on the CPU with a O(N*(N-1)/2). I decided I would look into offloading the contact detection from the CPU on to the GPU, using CUDA. I initially got the idea after reading Daniel Houghton's blog post about Compute Shaders in DirectX, but thought that I could do more than just simulate the particle positions on the GPU.

To begin with, I thought it might be valuable to benchmark what the performance is with certain numbers of bodies being simulated.
Contact Detection Contact Resolution FPS with 200 bodies FPS with 1500 bodies
CPU CPU 83 2-3
CPU GPU(CUDA) 98 4-5
GPU(CUDA) CPU 165 24-25
GPU(CUDA) GPU(CUDA) 202 27-28

As can be seen, using CUDA dramatically improved performance and allowed for 10x as many particles without visually ruining the simulation.

To start off with, you'll need to install the CUDA SDK and Toolkit. I'm using verson 6.0, which is available to members of the NVIDIA developers program, which you can sign up for and join for free. All of the functionality needed in this demo, however, is available with CUDA 5.5.

Next, you can either create a new CUDA project or add the functionality to an existing project. An overview of how to integrate CUDA into an existing project, or create a new CUDA project, can be found here for Visual Studio.

The first thing that you need to understand about CUDA, if you didn't already, is that CUDA utilizes parallel computing on the GPU, as opposed to the standard serial computing which runs on a single CPU. What this means is that CUDA excels at thing where you might use a foreach loop where each element doesn't necessarily rely on another element's output to work. Think of taking in an array of integers and getting the square root of each. Finding the root of the 5th element doesn't have any baring on what the result of the 2nd or 9th will be. There are ways to share data between threads and blocks, but if at all possible, your CUDA kernel should not rely on other executions. You can assign how many blocks of the GPU you wish to use and how many threads each block should consist of. This allows you to scale up and down the usage of the GPU based on how many executions you expect to need.

The second most important thing to know is that GPU memory and CPU memory are distinct entities(Unless you are utilizing Unified Memory available in CUDA 6.0 on GPUs that are compute 3.0+). This means that when you want to pass data back and forth, you need to allocate memory on the GPU, copy the data over, and free the memory afterwards. This is accomplished through the cudaMalloc, cudaMemcpy and cudaFree calls. Because of the distinct memory segments for the CPU and GPU you cannot pass pointers to CPU memory to the GPU and vice verse, as they cannot access each other's memory.

Now that a basic understanding of CUDA has been established, some examples.

When you are writing a function that will be called from the CPU but execute on the GPU it is called a kernel. Kernels are signified by __global__ at the beginning of the function declaration.

__global__ void findResolveContacts(PhysicsParticle* particles, int* contactIndex, int particleCount, float deltaTime)

This is the declaration for my kernel, which finds and resolves all physics contacts between particles in my engine. It takes in an array of PhysicsParticle objects, an int pointer that is shared between all blocks and threads, the number of particles, and the time delta for the frame.

Before calling the kernel, we need to allocate some device memory and copy over our particle data from CPU to GPU memory.

PhysicsParticle* d_a;
cudaMalloc((void **)&d_a, particleCount*sizeof(PhysicsParticle));
cudaMemcpy(d_a, particles, particleCount*sizeof(PhysicsParticle), cudaMemcpyHostToDevice);

int *contactIndex, *d_contactIndex;
contactIndex = (int*)malloc(sizeof(int));
*contactIndex = 0;
cudaMalloc((void **)&d_contactIndex, sizeof(int));
cudaMemcpy(d_contactIndex, contactIndex, sizeof(int), cudaMemcpyHostToDevice);

The CUDA versions of each call are standard, other than memcpy, which requires the type of copy occuring to be given. The two we will use are cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost, which are pretty self explanatory as to when they are used.

With all of our device memory assigned we can now call our kernel. It is called by

findResolveContacts<<<(particleCount + 512-1)/512,512>>>(d_a, d_contactIndex, particleCount, deltaTime);

The <<<>>> indicate that it is a kernel being called in the GPU and contains the number of blocks and the threads per block where n is the number of executions required, m is the number of threads per block, and (n + m-1)/m is the number of blocks. In this case I am using 512 threads per block with a variable particleCount. Following the >>> are the arguments, which must either be pointers to device memory, signified with d_ or direct values. If we were to simulate 1024 particles, for example, we would have 2 blocks with 512 threads each.

To actually achieve functionality we not only need to execute the kernel the correct number of times, but also need each execution to know what part of the data it needs to work with. We achieve this through finding the thread id

int i = threadIdx.x + blockIdx.x * blockDim.x;

threadIdx, blockIdx and blockDim are all values known to the kernel and can be used to determine exactly which thread the kernel is running in. threadIdx is the offset of the thread within the block, blockIdx is the offset of the block the thread is in, and blockDim is the size of each block. Thus with 1024 particles, we will have a range of ids from 0 to 1023, which exactly correlates with the array of PhysicsParticles we fed in.

Before the access the array, we need to do a quick check to make sure we aren't in the thread beyond the length of the array

if(i < particleCount && i >= 0)

Then we can begin looking for contacts between the particle at particles[i] and each other element of particles. Because this is a simple particles physics engine, we are only dealing with sphere collisions.

Vector3 iPos = particles[i].GetPosition();
for(int j = i + 1; j < particleCount; ++j)
{
Vector3 jPos = particles[j].GetPosition();
Vector3 direction;
direction.x = iPos.x - jPos.x;
direction.y = iPos.y - jPos.y;
direction.z = iPos.z - jPos.z;
float distance = direction.length();

Another important aspect to mention now, is that any time you want a function to be able to be called from the kernel, it must be prefaced with __device__. If you want to be able to call it from the kernel and the CPU you must preface it with __device__ __host__. I set j to i+1 so that two executions don't both resolve the same contact. Each execution checks for contacts only with particles at higher indices.

if(distance <= (particles[i].GetRadius() + particles[j].GetRadius()))
{
if(*contactIndex < LIMIT)
{
ParticleContact contact;
contact.first = i;
contact.second = j;
int index = atomicAdd(contactIndex, 1);

This section finds all particles that are within a distance that is less than or equal to their total radii, thus are in contact or interpenetrating. I mentioned earlier that there may be cases when you do need multiple executions to reference each other or otherwise share active data. The solution to this is in the form of a series of atomic operations. Each atomic operation will put a lock on the pointer, perform the operation, then unlock it. atomic operations generally return the value at the pointer prior to being changed.

V3 contactNormal = direction.normalized();
float penetration =  (particles[contact.first].GetRadius() + particles[contact.second].GetRadius()) - distance;
ResolveVelocity(particles[contact.first], particles[contact.second], contactNormal, deltaTime);
ResolveInterpenetration(particles[contact.first], particles[contact.second], contactNormal, penetration, deltaTime);

Finally I call the __device__ functions that will take in the two contacting particles and resolve their velocities and interpenetration. It is important to remember that pointers can be used within __device__ functions when they point to device memory, thus my ResolveVelocity and ResolveInterpenetration have the PhysicsParticle passed by reference, so that the values for position and velocity can be changed.

We will need to wait until every execution is complete before the proceed in the serial code on the CPU, so after we call findResolveContacts, we will

cudaDeviceSynchronize();

Synchronizing blocks the CPU until every device execution is complete. Once we are sure every contact has been found and resolved, we must copy the altered PhysicsParticle array back from device to host memory.

cudaMemcpy(particles, d_a, particleCount*sizeof(PhysicsParticle), cudaMemcpyDeviceToHost);

With that, we can free all the memory we used 

free(contactIndex);
cudaFree(d_a);
cudaFree(d_contactIndex);

and return the particle array to be copied back into the pointers used everywhere else in the engine.

i = 0;
for(auto it = particles.begin(); it != particles.end(); ++it)
{
(*it)->SetPosition(particleArray[i].GetPosition());
(*it)->SetVelocity(particleArray[i].GetVelocity());
++i;
}

With that, we have a physics particle system that runs on the GPU using CUDA many times faster than it did running serially on the CPU. If you have any questions or comments, feel free to leave them below.


Thursday, February 27, 2014

Rest API Based Game

Most multiplayer games rely on an active and constant network connection between all players. While this is nearly mandatory for most types of games, where quick responses and active play are crucial, not all games have such requirements, however. Games such as Civilization and other turn based games often have the ability to play over email and other such methods. This isn't always optimal, however, and there are other ways to facilitate turn based play. I explored the usage of a REST API to store and inform clients of the game state. I used a WAMP stack server and Unity for the actual game.

The first step in my exploration was how to keep track of each player. Normally with a multiplayer game, the client is connected to the server for the duration of the game, meaning that the server can assign an identifier to that client connection. With a non-continuous connection, I needed to create a way to identify a player between sessions. I decided that creating a simple account and login system would be the best.

I won't go into the details of the login system, as that is not the focus of the project.

Once a player is logged in the client polls the REST API for the current game state with a GET, which is output in XML format.
while($row = mysqli_fetch_array($result))
{
   //XML format info about player
   echo "
   <playerpos>
      <char>".$row['name']."</char>
      <x>".$row['x']."</x>
      <y>".$row['y']."</y>
      <class>".$row['class']."</class>
      <health>".$row['health']."</health>
   </playerpos>";
}

With the game state, the client assembles the game world and displays it to the player. Part of the game state information is also the current players turn. When the player makes their move, the client sends a POST to the server with their move, which is dealt with server side to determine the new game state.

This method doesn't rely on players being connected simultaneously and allows for larger scalability of a game session, since the server isn't managing a large number of simultaneous connections.

You can check out the simple game here.

Sunday, February 23, 2014

Production

Green Mountain Games Fest

Yesterday Phlegmatic Studios exhibited RoboRuckus, as our new name is, to hundreds of people at the first Green Mountain Games Fest at Champlain College. It proved to be a resounding success, with nearly everyone who saw the game giving positive feedback and the desire to see where we take the tech.

We were also privileged to have Rami Ismail of Vlambeer come by as well, and give us some valuable feedback. With the success and positive feedback we received from everyone at Green Mountain, we are extremely excited about where Phlegmatic is going and where we can take RoboRuckus.

Tuesday, February 4, 2014

Solar System Simulation

Creating a realistic simulation of a solar system has proven to be an interesting project. It seemed simple enough at first, just find out the orbital distance and speed of each planet, apply those to a particle, add the force of gravity from another particle acting as the sun, and you'll have a good simulation, right? Not quite.

First of all, what kind of units do you use? If you are getting your data from most places, you will be finding kilometers and kilograms are the standard units of measurement for nearly all aspects, and they get big fast. It gets hard to deal with every unit being either a magnitude of 20 above or below zero. How to solve the unit issue? Find one that balances all the planets, and the sun, so that nothing is too large or too small. Gigameters was a good solution for distance, being a magnitude of 6 larger than kilometers, it significantly increased the readability of numbers, bringing orbital distances into the 100s of Gms, rather than the hundred millions of Kms. Mass, on the other hand, is a much larger beast. The order of magnitude difference between the sun and the lowest mass planet, mercury, is 7. To solve this, I found that Yottagrams (e24 grams) is a viable unit for mass. While it means that the sun is still measured in the hundreds of millions, every other body in the system has a much smaller magnitude.

So, we have our units, Gigameters and Yottagrams. We put in all the data using the correct units and our planets are instantly pulled to the center.


Why would that happen? We used all the correct units for our planets, we used the actual gravitational constant, everything should work! Wait, the Gravitational Constant has units. We never converted the G from Km^3/(Kg s^2) to our new units of Gm^3/(Yg s^2). That would be our problem! The magnitude of the G is off by a few digits. Wolfram Alpha is very useful for things like this, meaning we don't need to worry about making a conversion mistake, as long as we ask for the correct units(Check it out).

With that out of the way, now we have all our units in accord, our gravity should work and lets see how it turns out.

After a little more playing around, I found that the moon does orbit the earth properly, but only when the "warp" is extremely high.

Friday, January 24, 2014

Production

Browser Controller

I have successfully implemented the browser controller, which currently works on most Android and iOS devices. The system implements WebSockets to make a connection to the server then feeds touch data through the socket connection. The interface as it stands currently allows for a large amount of future customization and alteration.

The current control scheme is a slight alteration of the Twin-Sticks scheme available through the Android App, with the caveat that the center of each stick is set to the position the touch input begins. An new scheme that is planned, but has not yet been implemented, is a somewhat gesture based input, where the input is based not on the delta from the center of the joystick(where the touch began) but on the delta between touchmove events.

Improving Character Motion

Another aspect of the input that we are working on is the slipperiness of the character when moving. The character currently moves based on forces being applied to a rigidbody that is part of the character, which is an effective way of maintaining realistic interactions between the player and other objects. The downside, however, is that because motion is created by force being applied to the unit, that it slides, rather than immediately changing directions when input is changed.

Saturday, January 18, 2014

Production

Implementing an alternative control interface

One of the major goals over the last semester was improving our control interface. Seeing as we are using a mobile device as the control interface, it both limits and opens up a lot of possibilities. More importantly than the actual interface, though, is the accessibility of the interface. Until now, our interface has been in the form of a lightweight Unity app that can be downloaded and installed on Android devices. While this worked extremely well for a majority of people who downloaded the app to play the game, some users found the that their device would not run the Unity application. We were also faced with the limitation of only being on Android, as we don't currently have the facilities to built to iOS and Unity for WindowsPhone currently lacks the networking functionality we need. With these facts in mind, I set out to develop an alternative interface that would be OS independent and would be able to run on a wider range of devices than the Unity Application. My solution? Websockets.

Implementing WebSockets with Unity

When I started out, I didn't exactly understand the WebSocket protocol. I'd never used it before, let alone HTML5 in general. I soon came across two promising C# implementations of a WebSocket server that I could potentially run within Unity as an alternative to the Unity Networking we were currently using. They being Alchemy Websockets and WebSocket-Sharp. I quickly realized Alchemy would not be a viable solution, as it is built in .NET 4.0, which Unity does not support. WebSocket-Sharp, on the other hand, was in .NET 3.5(Unity compatible) and even had a Unity Asset Store download available, which was a good indicator for me that it would be Unity compatible, so I set off with it.

This is more or less where I am now, I have WebSocket-Sharp successfully running within a Unity project, accepting connections from a quick html/javascript page I wrote up to connect to it and easily reading information back and forth between clients and the server, basically all of the functionality I need to convert the input system of the game from Unity Networking to WebSockets.

That said, I have hit a few snags now, for one, WebSocket-Sharp seems to leave a socket listener open(when running in the editor) no matter what I do to close it, which is a pretty big issue that I need to solve before I more further towards implementing it into the full game. I'm not sure at the point if it is an issue with Unity and the threads WebSocket-Sharp uses or an endemic issue with WebSocket-Sharp. Further investigation is required.

Friday, January 10, 2014

Production

Second Semester

After a break of a month, it's time for production to resume for the final 4 month. We've added 5 new team members to Phlegmatic Studios and are looking forward to working with them to improve our game as much as we can. We have a lot of ideas to discuss and decide upon and even more work to get done before the Senior Show in April.

Improvements

There are a few key areas that we, as a team, have decided we need to focus on first. The highest priority of these being the control scheme. Throughout the first few months of production, we mainly used a "twin-stick" configuration for controls, with two virtual joysticks as the method of input on the device, but as testing and feedback have shown time and time again: this isn't where we should be. Testers have found it lacking in many ways and desire something better. Our first task is to figure out what that "better" is and implement it as quickly and effectively as possible.

The next high priority area that we need to focus on is the aesthetic and theme of the world. As it stands right now, players are dropped in a pretty generic and empty feeling factory environment.


While it has worked for us up to this point, we have decided that we need to push it, either locking down on the factory setting by making the level more like an assembly line, or by taking it somewhere else entirely. The primary alternative idea, at the moment, it a Dr Sues feel, with comical looking machinery and strange devices whirling and buzzing around the level.

With these two primary points of focus in mind, we resume our work with new team members providing fresh and interesting new ideas that will take our project to new heights.