Wednesday, 7 July 2021

Optimizing Grid Simulations with SIMD

In games we sometimes run into problems in entity simulations that are O(n^2) such as dealing damage in a radius, collision, pathing behavior, etc. We can speed these problems up by using some kind of acceleration structure such as grids, bvhs, quad/octree, etc, making the average speed O(nlog(n)) instead. When we use these acceleration structures, we inherently get more branchy code and it becomes harder to use SIMD to gain performance benefits. I set out to write a small demo which uses a grid to solve a O(n^2) problem, but also gets good SIMD utilization in the process. The problem I chose was to simulate boids (https://eater.net/boids, Boids Video). 

The post will first show my naive implementation of boids, followed by how I optimized with a grid data structure, and then how I used SIMD to speed up the grid approach with some final thoughts afterwards. 

Boids

This post won't go into a lot of detail about how exactly boids work (the above link has more info + source), but the main idea is that each entity has a random starting position + velocity. The entity checks its surrounding radius, and applies 3 different rules to update its current velocity:

1) Try to align its velocity with the average velocity of entities around it

2) Try to move towards the center position of the entities around it

3) Avoid hitting other entities that are in its path

Boids are used to simulate things like flocks of birds or schools of fish. When implementing them, I double buffer the entities position and velocity so that when we apply the rules, we aren't reading position and velocity data that is mixed from last and current frames. The code ends up looking like the following:

struct bird
{
    v2 Vel;
    v2 Pos;
};

...

static b32 EvenFrame = true;

bird* PrevBirdArray = EvenFrame ? DemoState->PrevBirds : DemoState->CurrBirds;
bird* CurrBirdArray = EvenFrame ? DemoState->CurrBirds : DemoState->PrevBirds;
EvenFrame = !EvenFrame;

// NOTE: Update boids
for (u32 BoidId = 0; BoidId < DemoState->NumBoids; ++BoidId)
{
    v2 OldBoidPosition = PrevBirdArray[BoidId]->Pos;
    v2 OldBoidVelocity = PrevBirdArray[BoidId]->Vel;
    v2 NewBoidPosition = OldBoidPosition;
    v2 NewBoidVelocity = OldBoidVelocity;

    u32 NumBoidsInRadius = 0;
    v2 AvgFlockDir = {};
    v2 AvgFlockPos = {};
    v2 AvgFlockAvoidance = {};
    for (u32 NearbyBoidId = 0; NearbyBoidId < DemoState->NumBoids; ++NearbyBoidId)
    {
        if (NearbyBoidId != BoidId)
        {
            v2 DistanceVec = DemoState->BoidPositions[NearbyBoidId] - NewBoidPosition;
            f32 DistanceSq = LengthSquared(DistanceVec);

            if (DistanceSq < DemoState->BoidRadiusSq)
            {
                // NOTE: Calculate required averages
                ...
            }
        }
    }

    if (NumBoidsInRadius)
    {
        AvgFlockDir /= f32(NumBoidsInRadius);
        AvgFlockPos /= f32(NumBoidsInRadius);
    }
                    
    // NOTE: Apply rules
    ...
                    
    NewBoidPosition += NewBoidVelocity * FrameTime;
                    
    CurrBirdArray[BoidId].Vel = NewBoidVelocity;
    CurrBirdArray[BoidId].Pos = NewBoidPosition;
}

If we don't use a acceleration structure, these rules will make us run in O(n^2) because for each entity, we have to loop through all other entities, check if they are in radius and apply these rules. The main observation is that we don't actually care about most other entities if they are roughly evenly distributed in our map. Knowing this, we can use a acceleration structure to group entities based on their location and avoid having to check most other entities when applying our boid rules.

For reference, on my AMD Ryzen 7 2700X 3.70 GHz CPU, using the above code I was able to get to around ~400 entities at real time frame rates (this is all single core and will continue being single core for the rest of the post).

Grid Simulation

For this demo, I chose to use a grid since it seemed like a nice and easy fit for the problem (all entities are the same size and I'm expecting a fairly even distribution). I decided to rebuild the grid every frame instead of trying to update the grid from last frame. The reason was that a entity can be in more than one grid cell at a time, so either the entity has to store references to the 4 potential grid cells its inside + the index into each grid cells list of entities, or I would have to find the 4 grid cells based on the entities position and iterate across all the indices to find where in the list the entity reference is to remove it. Both options seemed a lot less clean than what I wanted and they are non issues if the grid is rebuilt every frame. 

The grid API is the following:

struct grid_range
{
    u32 StartX;
    u32 StartY;
    u32 EndX;
    u32 EndY;
};

struct grid_cell
{
    u32 NumIndices;
    block_arena IndexArena;
};

struct grid
{
    aabb2 WorldBounds;
    u32 NumCellsX;
    u32 NumCellsY;
    u32 MaxNumIndicesPerBlock;
    grid_cell* Cells;
};

inline grid GridCreate(linear_arena* Arena, platform_block_arena* BlockArena, aabb2 WorldBounds, u32 NumCellsX, u32 NumCellsY);
inline u32 GridAddEntity(grid* Grid, v2 Position, u32 EntityId);
inline grid_range GridGetRange(grid* Grid, v2 Pos, f32 Radius);
inline void GridClear(grid* Grid);

From the structs, each grid cell stores some kind of array/list of entity indices that are inside of its world bounds, and a grid is just a collection of all the cells + some metadata. The grid API is probably what you would expect but I do have some special allocators that I'm using. I have a linear allocator which I use to allocate the grid_cell struct array but I also have what I call platform_block_arena and block_arena.

The block_arena is a arena which just allocates fixed size blocks of memory and chains them in a linked list. The arena itself does not however call the OS for blocks of memory. Instead, it suballocates out of bigger blocks that are allocated by platform_block_arena. The platform_block_arena allocates larger blocks that are given by the OS and then various different block arenas can suballocate out of it. I do this setup to try and minimize the # of OS calls to allocate memory (in practice, its also nice because multiple systems can suballocate out of the same platform_block_arena). 

Looking at the grid cell index arena, it only allocates fixed sized blocks of memory so what it ends up being is a linked list of fixed sized arrays of 32bit unsigned integers that are our indices in that cell. Doing this sort of linked list of arrays setup is nice because it gives you the benefit of having a resizable data structure, while still being cache coherent up to your block size. You instead pay in terms of wasted memory space, since the block arena can only allocate fixed sized blocks. If memory is a concern, you can try and tune the fixed block size to have less dead space. 

The code follows in 3 stages now: 

1) Add all birds/entities to the grid

2) Update all birds/entities in the grid

3) Clear the grid

// NOTE: Add all birds to grid data structure
u32* BirdGridIds = PushArray(&DemoState->TempArena, u32, DemoState->NumBirds);
{
    for (u32 BirdId = 0; BirdId < DemoState->NumBirds; ++BirdId)
    {
        bird* CurrBird = PrevBirdArray + BirdId;
        BirdGridIds[BirdId] = GridAddEntity(&DemoState->Grid, CurrBird->Position, BirdId);
    }
}

We start off for step 1 by allocating a array of grid cells for each entity in a temporary arena that will be free'd later. We then add all the entities to our grid.

// NOTE: Update birds
for (u32 BirdId = 0; BirdId < DemoState->NumBirds; ++BirdId)
{
    bird CurrBird = PrevBirdArray[BirdId];
                    
    v2 NewBirdPosition = CurrBird.Position;
    v2 NewBirdVelocity = CurrBird.Velocity;

    u32 NumBirdsInRadius = 0;
    v2 AvgFlockDir = {};
    v2 AvgFlockPos = {};
    v2 AvgFlockAvoidance = {};

    grid_range Range = GridGetRange(&DemoState->Grid, NewBirdPosition, Max(DemoState->BirdRadiusSq, DemoState->AvoidRadiusSq));
    for (i32 GridY = Range.StartY; GridY <= Range.EndY; ++GridY)
    {
        for (i32 GridX = Range.StartX; GridX <= Range.EndX; ++GridX)
        {
            grid_cell* CurrCell = Grid->Cells + GridY * Grid->NumCellsX + GridX;

            // NOTE: Loop over all birds in the grid cell
            u32 GlobalIndexId = 0;
            for (block* CurrBlock = CurrCell->IndexArena.Next; CurrBlock; CurrBlock = CurrBlock->Next)
            {
                u32* BlockIndices = BlockGetData(CurrBlock, u32);
                u32 NumIndicesInBlock = Min(CurrCell->NumIndices - GlobalIndexId, Grid->MaxNumIndicesPerBlock);
                for (u32 IndexId = 0; IndexId < NumIndicesInBlock; ++IndexId, ++GlobalIndexId)
                {
                    u32 NearbyBirdId = BlockIndices[IndexId];
                                    
                    if (NearbyBirdId != BirdId)
                    {
                        bird* NearbyBird = PrevBirdArray + NearbyBirdId;

                        // NOTE: Find values we need for rules
                        ...
                     }
                }
            }
        }
    }
    
    if (NumBirdsInRadius)
    {
        AvgFlockDir /= f32(NumBirdsInRadius);
        AvgFlockPos /= f32(NumBirdsInRadius);
    }
                        
    // NOTE: Apply rules
    ...

    NewBirdPosition += NewBirdVelocity * FrameTime;

    // NOTE: Write into next bird array
    CurrBirdArray[BirdId].Velocity = NewBirdVelocity;
    CurrBirdArray[BirdId].Position = NewBirdPosition;
}

GridClear(Grid);

For updating each entity, we find the grid cells with entities in our specified radius and loop through them, getting the values we require for our 3 boid rules. You'll notice that because we have a linked list of arrays, the looping mechanism looks a bit more complex. Its first stepping through each block, and then each entity in that blocks array. We then clear the grid at the end which frees all the memory that each block_arena held, but does not clear the memory used by platform_block_arena, so we aren't actually giving back memory to the OS (the memory ends up being reused next time we build our grid).

Using the grid setup, I was able to get to 5-6 thousand entities at real time frame rates.

SIMD Small Primer


The idea behind SIMD is that we often run into algorithms that need to do the same operations on different pieces of data (hence, Single Instruction Multiple Data). For example, in image processing we often want to apply the same algorithm to each pixel, like when we apply gaussian blurs or other filters. The CPU exposes instructions that let us do the same instruction on multiple pieces of data at faster rates. For example, we can add 4 float values at the same speed as adding 1 float value (4x speed up). The way SIMD is exposed on x64 CPUs is via the SSE and AVX instruction sets (you can read more about them at https://software.intel.com/sites/landingpage/IntrinsicsGuide/). For this post, I'm limiting myself to any pre AVX SSE extensions for simplicity. 

The SSE instruction set gives us 2 new data types that we will be using, __m128 and __m128i. Both are 128bit vectors but one is a float register while the other is a integer register. We can then pack 4 32bit float values into one __m128 register, or 4 32bit integer values into one __m128i register. After filling in values, we can operate on these vectors, using regular operations like addition, subtraction and multiplication but each operation will do 4 of those at a time. So if we fill one register with values 0, 1, 2, 3 and the other with 4, 5, 6, 7, then a call to _mm_add_epi32 (add 32bit integers) will give us a vector with values 4, 6, 8, 10 (each lane of the vector operates independently of the others). 

When loading values into registers, we generally have 3 options:

1) We can call _mm_set_ps/_mm_set_epi32 which will let us specify the 4 values directly
2) We can call _mm_load_ps/_mm_load_si128 which will do a aligned load from memory. The requirements here is that the passed pointer is aligned to 16 bytes, and that the 4 values are contiguous in memory.
3) We can call _mm_loadu_ps/_mm_loadu_si128 which behave as the above but we don't have the alignment requirement. For current hardware, the performance between the 2 calls is the same except in some edge cases (if we cross a cache line boundaries), but we can ignore that for this post since it won't come up.

We have similar requirements for storing (_mm_store_ps/_mm_storeu_ps work like their load counterparts). We want to try and use the load/store calls provided by the instruction set since they are the quickest way to load/store 128bit registers from data in memory. The requirement this imposes on us is that we have to try and keep our data contiguous in memory as much as possible.

SIMDifying Grids

We already have a working grid and we understand how its stored in memory, so now we have to think about how we can apply SIMD to optimize it. The main bottleneck is the entity updating so we will focus on that step (the others are taking a minimal amount of computation in comparison). SIMD likes working on multiple entities at a time (for each loop compute 4 entities at a time instead of 1), so I like to use the following structs in my code:

union v1u_x4
{
    struct
    {
        __m128i x;
    };
        
    u32 e[4];
};

union v1i_x4
{
    struct
    {
        __m128i x;
    };
    
    i32 e[4];
};
 
union v1_x4
{
    struct
    {
        __m128 x;
    };
    
    f32 e[4];
};

union v2_x4
{
    struct
    {
        v1_x4 x, y;
    };

    v1_x4 e[2];
};

Using the above lets me more easily write code that works on multiple entities at a time, instead of trying to SIMDify updating one entity. This also happens to mimic more closely what GPU execution looks like. If you've written shader code before, you'll probably know that each time you write a float add or vec add, its actually happening for 32/64 elements at a time, but the programming model lets you write the code as if its only happening for one. Below is an example of how converting scalar code to SIMD code using my vector types ends up looking:

v2 NewBirdVelocity;
v2 NewBirdPosition;
v2 AvgFlockPos;
v2 AvgFlockDir;

// NOTE: Fly towards center
NewBirdVelocity += DemoState->MoveToFlockWeight * (AvgFlockPos - NewBirdPosition);
// NOTE: Avoid Others
NewBirdVelocity += DemoState->AvoidBirdWeight * AvgFlockAvoidance;
// NOTE: Align Velocities
NewBirdVelocity += DemoState->AlignFlockWeight * (AvgFlockDir - NewBirdVelocity);
                        
// NOTE: Clamp Velocity
f32 BirdSpeed = Clamp(Length(NewBirdVelocity), DemoState->MinSpeed, DemoState->MaxSpeed);
NewBirdVelocity = BirdSpeed * Normalize(NewBirdVelocity);
// NOTE: Avoid Terrain
NewBirdVelocity += DemoState->AvoidTerrainWeight * AvoidWallDir;
// NOTE: Update Position
NewBirdPosition += NewBirdVelocity * FrameTime;

=> 

v2_x4 NewBirdVelocity;
v2_x4 NewBirdPosition;
v2_x4 AvgFlockPos;
v2_x4 AvgFlockDir;

// NOTE: Fly towards center
NewBirdVelocity += DemoState->MoveToFlockWeight * (AvgFlockPos - NewBirdPosition);
// NOTE: Avoid Others
NewBirdVelocity += DemoState->AvoidBirdWeight * AverageData.AvgFlockAvoidance;
// NOTE: Align Velocities
NewBirdVelocity += DemoState->AlignFlockWeight * (AvgFlockDir - NewBirdVelocity);
                        
// NOTE: Clamp Velocity
v1_x4 BirdSpeed = Clamp(Length(NewBirdVelocity), V1X4(DemoState->MinSpeed), V1X4(DemoState->MaxSpeed));
NewBirdVelocity = BirdSpeed * Normalize(NewBirdVelocity);
// NOTE: Avoid Terrain
NewBirdVelocity += DemoState->AvoidTerrainWeight * AvoidWallDir;
// NOTE: Update Position
NewBirdPosition += NewBirdVelocity * FrameTime;

Since I'm designing the code to operate on 4 entities at a time, its usually enough for computation code to just have its types renamed and you get free parallelization (you still have to figure out how to load the correct values into the SIMD registers and store them after which we will go over below) The above SIMD code does the same thing as the scalar code but it computes it for 4 entities at a time.

The part where CPU SIMD gets more complicated over GPU is in memory operations. GPUs support arbitrary gather/scatter while CPUs do not (at least for SSE). On CPUs, the only things we have in SSE are aligned and unaligned loads/stores. We can however mimic gather/scatter operations by doing the memory operations in scalar code, but it is much slower than the SSE load/stores, so we want to avoid them as much as possible.

inline v1_x4 V1X4(f32 X, f32 Y, f32 Z, f32 W)
{
    v1_x4 Result = {};
    Result.x = _mm_set_ps(W, Z, Y, X);
    return Result;
}

inline v1_x4 V1X4Gather(f32* X, f32* Y, f32* Z, f32* W, v1u_x4 Mask)
{
    v1_x4 Result = V1X4(Mask.e[0] ? *X : 0, Mask.e[1] ? *Y : 0, Mask.e[2] ? *Z : 0, Mask.e[3] ? *W : 0);
    return Result;
}

With grids, we are accessing some random list of entities stored in each cell that can change frame to frame, so we would expect over time that it would get more and more incoherent as entities randomly move around. If we just take the code we have as is and SIMDify it, we basically have to use gather/scatter if we want to load 4 entities at a time, which reverts to scalar code and is much less efficient.

The solution I came up with was a bit of a lucky accident. I noticed that I'm double buffering all my entities already, and the grid is recreated every frame, so instead of writing entity at index 0 from the read buffer to index 0 in the write buffer, I'm going to write it in a memory location so that its nearby other entities in its cell. Below is some scalar pseudocode to illustrate the idea:

// NOTE: Loop over all grids and all entities in the grids
u32 GlobalIndexId = 0;
for (u32 GridY = 0; GridY < Grid->NumCellsY; ++GridY)
{
    for (u32 GridX = 0; GridX < Grid->NumCellsX; ++GridX)
    {
        grid_cell* CurrCell = Grid->Cells + GridY * Grid->NumCellsX + GridX;

        u32 BirdBlockIndexId = 0;
        for (block* CurrBlock = CurrCell->IndexArena.Next; CurrBlock; CurrBlock = CurrBlock->Next)
        {
            u32* BlockIndices = BlockGetData(CurrBlock, u32);
            u32 NumIndicesInBlock = Min(CurrCell->NumIndices - BirdBlockIndexId, Grid->MaxNumIndicesPerBlock);
            for (u32 IndexId = 0; IndexId < NumIndicesInBlock; ++IndexId)
            {
                u32 CurrBirdId = BlockIndices[IndexId];

                // NOTE: Load bird data
                v2 NewBirdPosition = PrevBirdArray[CurrBirdId].Pos;
                v2 NewBirdVelocity = PrevBirdArray[CurrBirdId].Vel;

                // NOTE: Get average data
                ...
                    
                // NOTE: Apply rules
                ...
                    
                NewBirdPosition += NewBirdVelocity * FrameTime;
                
                CurrBirdArray[GlobalIndexId].Vel = NewBirdVelocity;
                CurrBirdArray[GlobalIndexId].Pos = NewBirdPosition;
                GlobalIndexId += 1;
            }

            BirdBlockIndexId += NumIndicesInBlock;
        }
    }
}

The first thing you might notice is that I'm not iterating my array of entities element by element, but instead I'm iterating grid cell by grid cell. This also lets me not have to store a temporary array of grid cell ids for each entity. The other addition is I'm maintaining a global write index as I write my entities into the write array from the read array. This means that I write out the entities in traversal order, which happens to be grid cell by grid cell, so for the most part, I will have entities in each grid cell in contiguous blocks of memory which is exactly what SIMD likes! When I SIMDify this code, I can check if the 4 indices I load are sequential, and if they are do a regular load, otherwise do a gather for the edge cases which are not very common. This ends up being the following:

struct bird_array
{
    f32* PosX;
    f32* PosY;
    f32* VelX;
    f32* VelY;
};

inline u32 MoveMask(v1u_x4 V)
{
    // NOTE: Outputs 1 bit for each element if its most sig bit is true
    u32 Result = 0;
    Result = _mm_movemask_epi8(V.x);
    Result = ((Result >> 3) & 0x1) | ((Result >> 6) & 0x1) | ((Result >> 10) & 0x1) | ((Result >> 12) & 0x1);

    return Result;
}

...

// NOTE: Update birds
u32 GlobalIndexId = 0;
for (u32 GridY = 0; GridY < Grid->NumCellsY; ++GridY)
{
    for (u32 GridX = 0; GridX < Grid->NumCellsX; ++GridX)
    {
        grid_cell* CurrCell = Grid->Cells + GridY * Grid->NumCellsX + GridX;

        u32 BirdBlockIndexId = 0;
        for (block* CurrBlock = CurrCell->IndexArena.Next; CurrBlock; CurrBlock = CurrBlock->Next)
        {
            u32* BlockIndices = BlockGetData(CurrBlock, u32);
            u32 NumIndicesInBlock = Min(CurrCell->NumIndices - BirdBlockIndexId, Grid->MaxNumIndicesPerBlock);
            for (u32 IndexId = 0; IndexId < NumIndicesInBlock; IndexId += 4)
            {
                v1u_x4 CurrBirdId = V1UX4LoadUnAligned(BlockIndices + IndexId);
                u32 FirstBirdId = CurrBirdId.e[0];

                // NOTE: Check if we can do a load or if we have to gather
                v1u_x4 BirdValidMask = (V1UX4(IndexId) + V1UX4(0, 1, 2, 3)) < V1UX4(NumIndicesInBlock);
                v1u_x4 AlignedMask = {};
                {
                    v1u_x4 FirstBirdVec = V1UX4(FirstBirdId);
                    AlignedMask = (CurrBirdId - FirstBirdVec) == V1UX4(1);
                }

                // NOTE: Load bird data
                v2_x4 NewBirdPosition = {};
                v2_x4 NewBirdVelocity = {};
                {
                    if (MoveMask(BirdValidMask) == 0xF && MoveMask(AlignedMask) == 0xF)
                    {
                        // NOTE: We can do aligned loads here
                        NewBirdPosition = V2X4LoadUnAligned(PrevBirdArray.PosX + FirstBirdId, PrevBirdArray.PosY + FirstBirdId);
                        NewBirdVelocity = V2X4LoadUnAligned(PrevBirdArray.VelX + FirstBirdId, PrevBirdArray.VelY + FirstBirdId);
                    }
                    else
                    {
                        // NOTE: We have to do a masked gather here
                        NewBirdPosition = V2X4Gather(PrevBirdArray.PosX, PrevBirdArray.PosY, CurrBirdId, BirdValidMask);
                        NewBirdVelocity = V2X4Gather(PrevBirdArray.VelX, PrevBirdArray.VelY, CurrBirdId, BirdValidMask);
                    }
                }
                                                                                
                bird_average_data AverageData = GridGetAverageData(Grid, PrevBirdArray, NewBirdPosition,
                                                                   CurrBirdId, BirdValidMask);
                                        
                // NOTE: Apply rules
                ...
                    
                NewBirdPosition += NewBirdVelocity * FrameTime;
                                            
                // NOTE: Write into next bird array (we do unaligned store since sometimes we store less than 4
                // elements and then next write wont be aligned
                StoreUnAligned(NewBirdVelocity.x, CurrBirdArray.VelX + GlobalIndexId + IndexId);
                StoreUnAligned(NewBirdVelocity.y, CurrBirdArray.VelY + GlobalIndexId + IndexId);
                StoreUnAligned(NewBirdPosition.x, CurrBirdArray.PosX + GlobalIndexId + IndexId);
                StoreUnAligned(NewBirdPosition.y, CurrBirdArray.PosY + GlobalIndexId + IndexId);
            }

            BirdBlockIndexId += NumIndicesInBlock;
            GlobalIndexId += NumIndicesInBlock;
        }
    }
}

There are a couple of points to make about the above code: 

1) I rewrote the entity storage code to be SoA (struct of arrays) so that we can easily load our entity attributes using loads/stores and not have to do any shuffling. The arrays are also padded to the size of SSE vector registers. This way, when I have a v2_x4, its loading 4 sequential x values into one __m128 register, and 4 sequential y values into another __m128 register (everything is cleanly separated).

2) We can run into cases where a given cell has 3 entities for example, so when we try and load 4, one will be invalid and we need to track it and ignore its results. We use a mask to do that which is generated via movemask. 

3) We also use a movemask to figure out if we have sequential indices and can use a unaligned load or if we have to go into the slow path and do a gather.

4) When we store our entities, we always store all 4 of them, even if we only had a 1, 2, or 3 which are valid. The reason this works out and we don't later read garbage data is because we update the GlobalIndexId by only the # of valid entities, so the next loop iteration will overwrite the garbage ones we stored.

The other big change with SIMD is that all if statements that were used for computation have to be replaced with masks that cancel out unwanted values. 

v2 DistanceVec = NearbyBird->Position - NewBirdPosition;
f32 DistanceSq = LengthSquared(DistanceVec);

if (DistanceSq < DemoState->BirdRadiusSq)
{
    NumBirdsInRadius += 1;
                                
    // NOTE: Velocity Matching
    AvgFlockDir += NearbyBird->Velocity;

    // NOTE: Bird Flocking
    AvgFlockPos += NearbyBird->Position;
}
if (DistanceSq < DemoState->AvoidRadiusSq)
{
    // NOTE: Avoidance
    AvgFlockAvoidance += -DistanceVec;
}

=>

v2 DistanceVec = NearbyBird->Position - NewBirdPosition;
f32 DistanceSq = LengthSquared(DistanceVec);

bool BirdRadiusMask = DistanceSq < DemoState->BirdRadiusSq;
NumBirdsInRadius += BirdRadiusMask;
                                
// NOTE: Velocity Matching
AvgFlockDir += BirdRadiusMask * NearbyBird->Velocity;

// NOTE: Bird Flocking
AvgFlockPos += BirdRadiusMask * NearbyBird->Position;

bool AvoidRadiusMask = DistanceSq < DemoState->AvoidRadiusSq;
AvgFlockAvoidance += AvoidRadiusMask * -DistanceVec;

This is also something that GPU shader code does for you for free without you having to manually add it into your code. If the condition is false, the bool mask will be 0 so all the computation its multiplied by will be nullified, otherwise it will be equal to 1 and taken as is. 

For the GridGetAverageData, it basically works the same as before, stepping through neighboring cells one entity at a time but we check that 1 entity against all 4 of the entities we previously loaded in each instruction. 

With SIMD, I was able to run between 10-20k entities at real time frame rates. It does depend heavily on how you set your rules so that you don't get too many entities in a single grid cell, sometimes 40k was running at real time frame rates. 

Final Thoughts

I was pleasantly surprised to find how nice and clean this solution is to take advantage of a grid structure and SIMD optimizations. There is still probably a good amount of room for optimizing this code as it currently stands: 

1) I think because I double buffered, I can actually remove the gather in cases where a entity isn't referenced in more than one grid cell, since we are guarenteed coherent reads from last frame. Instead of a gather, it would have to behave as a masked load, so just ignore the extra values that get read and make sure our arrays are padded to avoid reading past allocations.

2) When I get the block average data, I compare 4 entities against 1, but it might be better to compare 4 against 4. It would take a bit more effort but this might end up being faster to do.

3) The code can be multi threaded. Each thread can work on a grid cell or maybe a block of a grid cell. You would have to be more careful with how threads update and store entities but as is, its very possible to do.

4) Right now I walk the grid left to right, bottom to top, but there might be a swizzled pattern that's better to walk since nearby entities might already be in cache due to GetAverageData reading nearby grid cells.

5) For the specific problem of boids, I can try to treat it more as a fluid sim and write velocity/position values into textures and do bilinear filtering to get average data instead of waking nearby entities for each entity. This would change the sim output but it would still approximate it well. I didn't want to do this because I wanted the solution I came up with to also apply to other problems like pathfinding and collision, but that would probably be much faster if we are just concerned with boids.

6) Since we already store entities in a grid, we can try to store entity positions as 8/16 bit fixed point values relative to their grid cell. You would have to make sure that precision is good enough but it can help reduce memory traffic if that is a bottleneck.

etc..

I think you can extend this to other acceleration structures or for other game problems but there is a important gotcha that would need to be taken care of. The main idea behind my optimization is to write entities in a different order than what they previously were in. This would break in any system that needs to store references to these entities since their position in the array is not stable. There are a few ways around this that I don't think would nullify this approach:

1) Generally we already have a handle system in place for storing references to entities. After we are done moving entities around, we can have each entity remember which slot they are in the handle system, and update its index to the new entity index in the array. This would require moving all the other entity data to new indexes which might be inefficient.

2) Do a similar approach as above but have the handle slots store 2 indexes, one for game logic data and one for sim data. This way we still update the sim data index but we avoid having to copy around all the other entity game logic data.

This kind of optimization also appears in different contexts. I recently started to try and write a 2d graph visualizer and you run into similar problems where each node is affected in some way based on nearby nodes, and you want to use a acceleration structure to speed that up. In that case I'm writing it on the GPU so maybe I'll write a future blog post about that. 

For now, you can find some code for this demo here: https://github.com/Ihaa21/Boids The demo might have some todo's and things that I haven't gotten around too but you can use it as a reference for how I went about optimizing this algorithm.

Wednesday, 13 June 2018

Sparse Voxel Octree Renderer (Fluffy)

Fluffy is a realtime CPU sparse voxel octree renderer which is coded in C++ using AVX instructions. The renderer supports real time rendering of arbitrarily oriented octree's with per pixel detail on a 512 by 512 screen using a single core. I used Intel VTune amplifier to profile and improve the code for this project. The code for this project can be found at https://github.com/Ihaa21/Fluffy. Below I will be detailing the process of optimizing this project from 800ms per frame to 20ms per frame.

Rendering Algorithm:

Fluffy is a attempt at building a rendering algorithm where instead of the run time being decided by model complexity, it is decided by the size of the screen. This is done by hierarchically traversing models stored as octrees, and occluding as many nodes and their children as possible.

The renderer begins by projecting the root node to the screen and checking if the node is occluded. If it isn't we check if the node is of a single pixels size. If it is, we render it, otherwise we split it into its 8 children, and we recursively apply the above algorithm in a front to back order. This way, we are only traversing nodes which are visible, and if a node isn't visible, it and its children are all rejected, preventing us from doing needless checks. The above algorithm's run time is mostly dependent on the size of the screen instead of the geometry since it only traverses nodes which are visible, and renders nodes as soon as they are of a single pixel size. The pseudo code can be found below:

void RenderOctree(v3 NodeCenter, f32 NodeRadius, octree* Node)
{
    i32 MinX, MaxX, MinY, MaxY;
    ProjectNodesCorners(NodeCenter, NodeRadius, &MinX, &MaxX, &MinY, &MaxY);

    MinX = ClipMinX(MinX);
    MaxX = ClipMinX(MaxX);
    MinY = ClipMinX(MinY);
    MaxY = ClipMinX(MaxY);

    if (IsNodeOccluded(MinX, MaxX, MinY, MaxY))
    {
        return;
    }

    f32 ChildRadius = NodeRadius / 2.0f;
    v3 ChildrenCenters[8] = GetChildrenCenters(NodeCenter, NodeRadius);

    // NOTE: Sort children centers such that we traverse them front to back
    f32 Keys[8] =
    {
        Length(ChildrenCenters[0]),
        Length(ChildrenCenters[1]),
        Length(ChildrenCenters[2]),
        Length(ChildrenCenters[3]),
        Length(ChildrenCenters[4]),
        Length(ChildrenCenters[5]),
        Length(ChildrenCenters[6]),
        Length(ChildrenCenters[7]),
    };
    u32 Order = {0, 1, 2, 3, 4, 5, 6, 7};
    SortLeastToGreatest(Keys, Order);

    for (u32 ChildId = 0; ChildId < 8; ++ChildId)
    {
        u32 SortedId = Order[ChildId];

        RenderOctree(ChildrenCenters[SortedId], ChildRadius, Node->Children[SortedId]);
    }
}

I visualized how many recursions the algorithm above needs to perform for every pixel and we can see the results below (different colors represent different number of recursions):

The picture on the left is when I used a float (not rounded) min/max values to decide if a node is small enough to render. The picture on the right is when I used the rounded integer min/max values to decide if a node is small enough to render. We can see that there are lots of nodes in the right picture which need fewer recursions than the picture on the left.

Algorithm Optimizations

Quicker projection calculation

In the above renderer pseudo code, we treat each octree node as a cube and project its 8 corners onto the screen to calculate the min/max bounds that the cube takes up on screen. Doing so is a very slow operation, and had to be executed millions of times per frame. To speed this process up, I pretend every oriented octree node is a actually the smallest screen facing cube that encapsulates that node. I would then calculate which corners of this cube correspond to the min/max values of the nodes projected space.
In certain orientations, the corners which correspond to the min/max values change as can be seen in the images below:
To speed up cube projection, I came up with 2 algorithms. Algorithm 1 was to project the min/max corners of the front and back face, and then calculate min/max values from the 4 projected points. This halved the amount of corners I had to project. Algorithm 2 projected only 2 x,y value pairs but would add extra if statements to figure out which corners x,y values corresponded to the projected cubes min/max values. In the above image for example, the center cube has 3 points which correspond to the cubes min/max values on screen. But, the top left corner only needs its y value projected to calculate the max y, and the top right corner only needs its x value projected to calculate the max x. The algorithm would identify such cases depending on which faces of the cube are visible on screen. Reviewing benchmarks below, we can see that algorithm 2 turned out to be the faster algorithm.

MeanAverageStDevMinMax
Scalar211.883003212.75650424.124100585205.675797230.36496
Algorithm 175.26851775.540828731.31218409173.54123784.518555
Algorithm 274.27075274.431776880.98439302872.26805985.476723

Minimizing Sorting:


Given that the program traverses millions of octree nodes a frame, it can't feasibly sort every nodes children to be ordered front to back. At the point I decided to examine this issue, the sorting was taking roughly 25% of the programs run time. If a octree node is rotated, the program finds the largest encapsulating box that is facing the camera, and renders that box instead. When drawing these nodes, I realized most of the time, the sorted order of the parent is the same as the child's, but I also identified cases where that isn't true shown in the images below:

I simplified the above example to a 2d case instead of 3d, but the same principles apply. We can see that the parent node on the left has 4 nodes, ABCD. Nodes A, B and nodes C, D are both the same distance away from the camera so to traverse our nodes front to back, we can traverse A->B->C->D or we can traverse A->B->D->C or any other permutation where we flip A with B or C with D. If we then examine the children of B, we see that node 0 is strictly closer to the camera than node 1, and node 2 is strictly closer than node 3. So there is only one possible order. This is an example of a order changing from the parent to the children.

I realized later that the only times that the traversal order for a parent changes for the children is when the faces of the parent which are visible are different than the faces of the children. This change of visible faces only happens when the parent node intersects the x or y axis, while the child node does not as can be seen in the below picture:

Therefore, I only have to sort nodes when they intersect the x or y axis, otherwise I can propogate the parents traversal order onto the children. When profiling, almost 100% of our nodes don't fall into this category and we don't even spend 1% of our programs runtime sorting nodes anymore.

Orthographic Mode:

The last optimization is to try to render as many nodes using orthographic projection as possible. When performing perspective projection, we have to store world positions of nodes and perform divisions to project nodes to the screen. This means we use up more memory and have to wait for long latency instructions which can stall our CPU pipelines. So Fluffy checks if a nodes backface projects to the same pixels as the front face, which means that the node is so small that the effects of perspective projection are no longer visible. This is shown in the below images:

Fluffy checks if nodes are ready for orthographic projection by checking if the top right corners of the front and back faces of the cube project to the same pixel. If they do, it switches to orthographic mode and subdivides the node in screen space by finding midpoints between every edge as shown in the below image:

Below are the performance results of applying this method.


MeanAverageStDevMinMax
Scalar175.455917174.889116219.78129413127.600182217.117813
Ortho1148.587189149.605005216.38707136107.362137182.3797

SIMD Optimizations

Processing 8 nodes at a time

Originally, I tried to apply SIMD on the original algorithm which was outlined above, which meant Fluffy was processing one node at a time. This resulted in lots of horizontal operations like finding the min/max value in a AVX vector. I decided to instead process 8 nodes at a time doing as much of it in SIMD as possible, then switching to scalar for things like checking node occlusion. This helped make the code more parallel and easier to implement in SIMD.

Screen as 8x8 blocks

The algorithm touches many pixels to render the octree to the screen. I knew that every row of pixels touched by a node would result in a cache miss once we had to jump to the next row of pixels. I tried to group pixels into 8x8 blocks, each being a 1 bit coverage mask instead of a 32 bit depth value so that I could lower the amount of cache misses in coverage buffer access. This however incurs a performance hit to generate masks whenever we loop per pixel since sometimes our projected cube partially covers a 8x8 block of pixels. The performance results can be seen below:
MeanAverageStDevMinMax
Scalar Occlusion50.59556250.441757199.01711967423.43388768.161186
8x8 Occlusion61.35755261.0286472410.9266814127.88140183.063866

Surprisingly, this method made the performance worse because of the extra instructions it took to generate the pixel masks. In the end, I decided to use a 8 bit coverage mask with a scalar loop to check for node occlusion to avoid any extra operations to generate masks, while shrinking my rendered buffer by a factor of 4 (from 32 bit depth).

Tuesday, 8 May 2018

Programming Portfolio

Software Ray Tracer: 
Link: https://ihorszlachtycz.blogspot.ca/2018/05/software-ray-tracer-pixel-scratcher.html
Source Link: https://github.com/Ihaa21/SoftwareRayTracer

Familiarized myself with trends in current game lighting and global illumination by coding a C++ 3D software ray tracer with support for indirect light bounces and transparent materials from scratch.

OpenGL Deferred Renderer:
Link: https://ihorszlachtycz.blogspot.ca/2018/05/deferred-renderer-and-model-loader-coco.html
Source Link: https://github.com/Ihaa21/Coco-Deferred-Renderer

Built a program using C++ and OpenGL which rendered 3D geometry from .obj files with support for texture mapping and deferred shading in GLSL shaders. This project was driven by my interest in 3d rendering and shader programming.

C Compiler:

Link: https://ihorszlachtycz.blogspot.ca/2018/05/c-compiler-nutella.html
Source Link: https://github.com/Ihaa21/nutella_compiler

Gained understanding about how code gets translated and read by a CPU by building a compiler that translates a subset of C to x86 assembly. The compiler supports runtime execution of code and was built from scratch in C++.

Independent Video Game:


Link: https://ihorszlachtycz.blogspot.ca/2018/05/independent-video-game-rosemary.html

Worked independently full time and passionately for 9 months on a multi platform video game app for Android and iOS written completely in C++ with a memory management, profiling tools, and an OpenGL ES renderer. The project is entirely self-driven and built without any external libraries.

Software Rasterizer:
Link: https://ihorszlachtycz.blogspot.ca/2018/05/software-triangle-rasterizer-kiwi.html
Source Link: https://github.com/Ihaa21/Kiwi-Software-Rasterizer

Implemented a 3D software renderer in C++ that renders transformed textured triangles with depth buffering, and a top left filling rule. By doing so, I gained a greater understanding about the perspective transform, and the actions a GPU must perform in order to produce images on our screens.

Realtime Software Octree Renderer:
Link: http://ihorszlachtycz.blogspot.com/2018/06/sparse-voxel-octree-renderer-fluffy.html
Source Link: https://github.com/Ihaa21/Fluffy

Implemented a realtime octree software renderer in C++ which can render highly detailed objects on a single core. The renderer implements a level of detail algorithm to only render per pixel detail, making rendering of complex models dependent on screen size instead of model complexity.

C++ Neural Network:

Link: https://ihorszlachtycz.blogspot.ca/2018/05/neural-network-cognition.html
Source Link: https://github.com/Ihaa21/Cognition-NeuralNetwork-

Coded a neural network that recognizes and reconstructs handwritten digits in images from scratch in C++. This project was inspired by my current research in the field of artificial intelligence (AI) and image recognition.

C Compiler (Nutella)

Nutella is a compiler that transforms a subset of C to x86 assembly. It supports integers, floats, pointers and structs as well as if statements and while statements. It has support for (+ - * \) math operations, (==, !=, >=, <=, >, <) comparison operations as well as (*, &) pointer operations. The compiler also supports compile time code execution with support to run math operations at compile time. The code for this project can be found at https://github.com/Ihaa21/nutella_compiler.

This project started off as a school project where we where supposed to work in groups to build a compiler. I was fascinated by the idea of building a compiler so I decided to build one on my own without external tools to help parse code. Below I will be detailing how the compiler converts a subset of C code to x86 assembly:

Parsing:

The compiler starts off by parsing the code into tokens, and while parsing it builds a type list, function list, and a IR graph (intermediate representation). The IR graph has nodes that represent variable definition, assignment, as well as all the operations supported by the compiler such as +, -, etc.

Intermediate Representation:

The IR graph handles if statements by storing a node for the if statement itself, a graph node that points to a subgraph to execute if the if statement is taken, as well a node to the first line of code outside of the if statement. While loops are stored in a similar fashion. This method of storage is useful for the compiler backend because the backend will be able to know where to insert jump statements in the code and to which lines the jumps have to jump to. It also is useful because of the compilers feature for compile time code execution.


Compile Time Code Execution:

Compile time code execution was inspired by Jonathan blow's language where you can put a #run statement next to any statement of code, and that statement will be executed at compile time of theprogram. This avoids the need for things like template meta programming when we want to pre-compute constants for our code, and it can be extended to build look up tables and do static code analysis. In Nutella, the #run directive works the same way and it supports precomputation of constants in the code as well as function calls.

x86 Backend:

After the IR graph is built, the program executes any #run statements through an interperter that understands how to execute the IR graph. Once the constants are placed into the code, we invoke the backend to convert the IR to x86 assembly.

The backend steps through the graph and calculates the stack offsets of each local variable in the program. Afterwards, it steps through the graph again and it converts every node to the x86 assembly instructions that represent it. The backend tries to keep as few registers in use at any time as possible, so for every instruction, it loads the value from the stack that is needed into registers, it operates on the values, and then it pushes the result back to the stack.

Working on this project was one of the coolest thing's I've ever coded, it opened up my eyes to how code optimization occurs, the kind of information interpreters have to process, and it also taught me a lot about parsing strings. Building the compiler from scratch was also useful because I got to see how the compiler has to handle various error cases and outputting the correct error message. If you'd like to check the code for this project, it can be found at https://github.com/Ihaa21/nutella_compiler.

Software Triangle Rasterizer (Kiwi)


Kiwi is a software triangle rasterizer I built in C++. The rasterizer supports textured triangles with depth interpolation, and a top left filling rule. The code for this project can be found at https://github.com/Ihaa21/Kiwi-Software-Rasterizer/tree/master/code. The Kiwi renderer works in 2 stages:

Vertex Processing:

The renderer begins with a vertex shader stage where every vertex gets transformed into pixel coordinates. At this stage, data such as Z and UV are setup to be interpolated during rasterization. The vertices are also sorted into counter clockwise winding.

Rasterization and Fragment Processing:

Afterwards, Kiwi loops through every triangle and performs triangle discarding tests, as well as screen clipping. I calculate a minimum bounding box around the triangle and begin rasterizing the triangle. Once the program encounters a pixel that is within the triangle, Kiwi performs a depth test, interpolates the triangles Z and UV values, samples the triangles texture, and renders the pixel onto the screen. The results of this method can be seen below:




Independent Video Game (Rosemary)

Rosemary is a tower defense game I've been working on independently for iOS and Android. The game is written completely in C++ and uses no external libraries. There are over 50 different types of unique monsters, each with their own abilities and behaviors. I've written every component of Rosemary myself and have been coding/designing the game on my own for almost a year. Below, I go over how the systems work and some of the problems I faced during development:

Debugging/Profiling

Rosemary has a built in profiler that measures how long function calls take, and graphs them relative to the frames duration. The profiler logs function timings with calls to TIMED_FUNC() at the beginning of function definition. After the timings are collected, it is processed and transformed to be displayed on screen. Below is an example of profilers output:
The multi colored blocks represent the time span a function took to run (the function can be seen by hovering over the block). The top blocks are the first measured function calls. The lower blocks are function calls that are called from inside the higher blocks. The black bar at the end of the screen is where the 16.6ms target is. If we pass that bar, it means the game is running slower than the target frame rate. Rosemary stores 256 frames of debug data and by clicking on the frame bar at the top of the screen, we can pause the profiling to view the function call times for a different frame. Rosemary also can display a Top Clock function list, which is a list of functions that took the most CPU time to run. 
Run time Compilation

Rosemary supports run time compilation of C++ code. This is achieved by having a platform specific DLL (Win32, Android, etc) and a game DLL. The platform DLL stores only the platform-specific code that the game needs to run (creates a window, talks with the sound driver, etc). The game DLL stores all the game code, from rendering to AI. When I recompile the game DLL, the platform layer notices that the game DLL has changed, so it swaps the old DLL in memory with the new DLL, and updates any function pointers that it relies on . Doing so allows me to recompile my game at run time and see the changes occur in real time:
Path finding

Since Rosemary is a tower defense where the player builds mazes; I need to have a system that determines where the monsters have to go to get across the maze. To do so, the game divides the world into tiles that are labelled as empty or taken, and performs a breadth first search to figure out how to make monsters get across the maze. Unfortunately, a regular BFS causes monsters to sometimes walk in "L" motions when they should be walking diagonally. To solve this, I had the algorithm alternate which neighbors would be added to the BFS queue depending on the position of the cell in the world. This gave the proper zigzag movement that looked a lot more natural and avoided more damage from the towers.
To make the system even more visually pleasing, I wanted to get rid of the zigzagging that occurred and have the monsters move diagonally when possible. To do so, once the monsters get to a tile and are searching where to go next, the game finds where the next tile and the next tile after that is. It then shoots a ray cast to see if moving to the farther tile is allowed, and if so, the monster will move in that direction. This got rid of the zigzag motion and made the monsters movement look a lot more appealing.
Sound Mixing

The game loads WAV sound files and stores stereo sample buffers in the asset file for loading at runtime. The platform layers notify the game with the number of samples the game should write, and how many samples have elapsed since the previous frame. The game then loops through all the sounds that are currently playing, and it concatenates the samples together into the output buffer. Before concatenating, the game performs any pitch changes that the sounds metadata requests. Usually, the game writes 2 seconds of sound extra to prevent audio cutting during any dropped frames.

Entity System

Towers are non moving and are all the same size, they have AI that looks for targets to attack and can have more than one ability that they cast. Projectiles always move towards a target and have logic to handle responses when the target dies, and what to do on impact. Projectiles are usually created by abilities that a monster or tower are casting, so they have different AI states to handle those abilities. An example of this is a spike tower that lays spikes within a radius around it. The spikes themselves are projectiles in an inactive state so when a monster gets near the projectile, it takes damage and the spike is destroyed.
Monsters are the most complex entities in the game with various types of AI's such as flying, walking, swimming and burrowing. They can have multiple abilities and therefore, use similar AI functions to cast abilities as towers. There is also a meta entity called a monster group. This is for bosses that are made out of multiple monsters, but that still function as a single entity. The Centipede monster for example, is made up of multiple smaller monsters chained together in a line. The monster group state stores pointers to all the monsters it controls and has a separate update loop for them. It then sets their state so that the monster AI loop will correctly simulate the monsters for the monster group state needs.
Memory Management

Rosemary is built to only ever perform one allocation from the OS in the entire programs life cycle. At startup, the game requests a large block of memory from the OS. Once it receives that memory, the game partitions it such that every system has access to as much memory as it needs.  For regular allocations, Rosemary mostly uses linear allocators and free list allocators. For example, I use a free list allocator for all the entities in my game but I use a linear allocator for partitioning the game memory across systems and for temporary single frame allocations.

For asset memory management, every level requests a set of assets from the asset file that are used in that level, and upon level start up, those assets are loaded. Once we quit out of the level, the assets memory is marked as usable and overwritten by the next level that we play. This way, the game only needs enough asset memory for a single level, which helps with cutting down memory usage.

Rendering

Rosemary has a OpenGL/OpenGL ES renderer that supports instancing, SRGB correct images, and premultiplied alpha. The renderer has a command buffer that the game logic populates as its simulating a frame. This buffer contains any command that the game pushes to render. An example of a command is to render a sprite. While receiving commands, the command buffer tries to batch as many commands together based on texture used as it can, to lower driver overhead later when rendering.