Wednesday, 7 July 2021

Optimizing Grid Simulations with SIMD

In games we sometimes run into problems in entity simulations that are O(n^2) such as dealing damage in a radius, collision, pathing behavior, etc. We can speed these problems up by using some kind of acceleration structure such as grids, bvhs, quad/octree, etc, making the average speed O(nlog(n)) instead. When we use these acceleration structures, we inherently get more branchy code and it becomes harder to use SIMD to gain performance benefits. I set out to write a small demo which uses a grid to solve a O(n^2) problem, but also gets good SIMD utilization in the process. The problem I chose was to simulate boids (https://eater.net/boids, Boids Video). 

The post will first show my naive implementation of boids, followed by how I optimized with a grid data structure, and then how I used SIMD to speed up the grid approach with some final thoughts afterwards. 

Boids

This post won't go into a lot of detail about how exactly boids work (the above link has more info + source), but the main idea is that each entity has a random starting position + velocity. The entity checks its surrounding radius, and applies 3 different rules to update its current velocity:

1) Try to align its velocity with the average velocity of entities around it

2) Try to move towards the center position of the entities around it

3) Avoid hitting other entities that are in its path

Boids are used to simulate things like flocks of birds or schools of fish. When implementing them, I double buffer the entities position and velocity so that when we apply the rules, we aren't reading position and velocity data that is mixed from last and current frames. The code ends up looking like the following:

struct bird
{
    v2 Vel;
    v2 Pos;
};

...

static b32 EvenFrame = true;

bird* PrevBirdArray = EvenFrame ? DemoState->PrevBirds : DemoState->CurrBirds;
bird* CurrBirdArray = EvenFrame ? DemoState->CurrBirds : DemoState->PrevBirds;
EvenFrame = !EvenFrame;

// NOTE: Update boids
for (u32 BoidId = 0; BoidId < DemoState->NumBoids; ++BoidId)
{
    v2 OldBoidPosition = PrevBirdArray[BoidId]->Pos;
    v2 OldBoidVelocity = PrevBirdArray[BoidId]->Vel;
    v2 NewBoidPosition = OldBoidPosition;
    v2 NewBoidVelocity = OldBoidVelocity;

    u32 NumBoidsInRadius = 0;
    v2 AvgFlockDir = {};
    v2 AvgFlockPos = {};
    v2 AvgFlockAvoidance = {};
    for (u32 NearbyBoidId = 0; NearbyBoidId < DemoState->NumBoids; ++NearbyBoidId)
    {
        if (NearbyBoidId != BoidId)
        {
            v2 DistanceVec = DemoState->BoidPositions[NearbyBoidId] - NewBoidPosition;
            f32 DistanceSq = LengthSquared(DistanceVec);

            if (DistanceSq < DemoState->BoidRadiusSq)
            {
                // NOTE: Calculate required averages
                ...
            }
        }
    }

    if (NumBoidsInRadius)
    {
        AvgFlockDir /= f32(NumBoidsInRadius);
        AvgFlockPos /= f32(NumBoidsInRadius);
    }
                    
    // NOTE: Apply rules
    ...
                    
    NewBoidPosition += NewBoidVelocity * FrameTime;
                    
    CurrBirdArray[BoidId].Vel = NewBoidVelocity;
    CurrBirdArray[BoidId].Pos = NewBoidPosition;
}

If we don't use a acceleration structure, these rules will make us run in O(n^2) because for each entity, we have to loop through all other entities, check if they are in radius and apply these rules. The main observation is that we don't actually care about most other entities if they are roughly evenly distributed in our map. Knowing this, we can use a acceleration structure to group entities based on their location and avoid having to check most other entities when applying our boid rules.

For reference, on my AMD Ryzen 7 2700X 3.70 GHz CPU, using the above code I was able to get to around ~400 entities at real time frame rates (this is all single core and will continue being single core for the rest of the post).

Grid Simulation

For this demo, I chose to use a grid since it seemed like a nice and easy fit for the problem (all entities are the same size and I'm expecting a fairly even distribution). I decided to rebuild the grid every frame instead of trying to update the grid from last frame. The reason was that a entity can be in more than one grid cell at a time, so either the entity has to store references to the 4 potential grid cells its inside + the index into each grid cells list of entities, or I would have to find the 4 grid cells based on the entities position and iterate across all the indices to find where in the list the entity reference is to remove it. Both options seemed a lot less clean than what I wanted and they are non issues if the grid is rebuilt every frame. 

The grid API is the following:

struct grid_range
{
    u32 StartX;
    u32 StartY;
    u32 EndX;
    u32 EndY;
};

struct grid_cell
{
    u32 NumIndices;
    block_arena IndexArena;
};

struct grid
{
    aabb2 WorldBounds;
    u32 NumCellsX;
    u32 NumCellsY;
    u32 MaxNumIndicesPerBlock;
    grid_cell* Cells;
};

inline grid GridCreate(linear_arena* Arena, platform_block_arena* BlockArena, aabb2 WorldBounds, u32 NumCellsX, u32 NumCellsY);
inline u32 GridAddEntity(grid* Grid, v2 Position, u32 EntityId);
inline grid_range GridGetRange(grid* Grid, v2 Pos, f32 Radius);
inline void GridClear(grid* Grid);

From the structs, each grid cell stores some kind of array/list of entity indices that are inside of its world bounds, and a grid is just a collection of all the cells + some metadata. The grid API is probably what you would expect but I do have some special allocators that I'm using. I have a linear allocator which I use to allocate the grid_cell struct array but I also have what I call platform_block_arena and block_arena.

The block_arena is a arena which just allocates fixed size blocks of memory and chains them in a linked list. The arena itself does not however call the OS for blocks of memory. Instead, it suballocates out of bigger blocks that are allocated by platform_block_arena. The platform_block_arena allocates larger blocks that are given by the OS and then various different block arenas can suballocate out of it. I do this setup to try and minimize the # of OS calls to allocate memory (in practice, its also nice because multiple systems can suballocate out of the same platform_block_arena). 

Looking at the grid cell index arena, it only allocates fixed sized blocks of memory so what it ends up being is a linked list of fixed sized arrays of 32bit unsigned integers that are our indices in that cell. Doing this sort of linked list of arrays setup is nice because it gives you the benefit of having a resizable data structure, while still being cache coherent up to your block size. You instead pay in terms of wasted memory space, since the block arena can only allocate fixed sized blocks. If memory is a concern, you can try and tune the fixed block size to have less dead space. 

The code follows in 3 stages now: 

1) Add all birds/entities to the grid

2) Update all birds/entities in the grid

3) Clear the grid

// NOTE: Add all birds to grid data structure
u32* BirdGridIds = PushArray(&DemoState->TempArena, u32, DemoState->NumBirds);
{
    for (u32 BirdId = 0; BirdId < DemoState->NumBirds; ++BirdId)
    {
        bird* CurrBird = PrevBirdArray + BirdId;
        BirdGridIds[BirdId] = GridAddEntity(&DemoState->Grid, CurrBird->Position, BirdId);
    }
}

We start off for step 1 by allocating a array of grid cells for each entity in a temporary arena that will be free'd later. We then add all the entities to our grid.

// NOTE: Update birds
for (u32 BirdId = 0; BirdId < DemoState->NumBirds; ++BirdId)
{
    bird CurrBird = PrevBirdArray[BirdId];
                    
    v2 NewBirdPosition = CurrBird.Position;
    v2 NewBirdVelocity = CurrBird.Velocity;

    u32 NumBirdsInRadius = 0;
    v2 AvgFlockDir = {};
    v2 AvgFlockPos = {};
    v2 AvgFlockAvoidance = {};

    grid_range Range = GridGetRange(&DemoState->Grid, NewBirdPosition, Max(DemoState->BirdRadiusSq, DemoState->AvoidRadiusSq));
    for (i32 GridY = Range.StartY; GridY <= Range.EndY; ++GridY)
    {
        for (i32 GridX = Range.StartX; GridX <= Range.EndX; ++GridX)
        {
            grid_cell* CurrCell = Grid->Cells + GridY * Grid->NumCellsX + GridX;

            // NOTE: Loop over all birds in the grid cell
            u32 GlobalIndexId = 0;
            for (block* CurrBlock = CurrCell->IndexArena.Next; CurrBlock; CurrBlock = CurrBlock->Next)
            {
                u32* BlockIndices = BlockGetData(CurrBlock, u32);
                u32 NumIndicesInBlock = Min(CurrCell->NumIndices - GlobalIndexId, Grid->MaxNumIndicesPerBlock);
                for (u32 IndexId = 0; IndexId < NumIndicesInBlock; ++IndexId, ++GlobalIndexId)
                {
                    u32 NearbyBirdId = BlockIndices[IndexId];
                                    
                    if (NearbyBirdId != BirdId)
                    {
                        bird* NearbyBird = PrevBirdArray + NearbyBirdId;

                        // NOTE: Find values we need for rules
                        ...
                     }
                }
            }
        }
    }
    
    if (NumBirdsInRadius)
    {
        AvgFlockDir /= f32(NumBirdsInRadius);
        AvgFlockPos /= f32(NumBirdsInRadius);
    }
                        
    // NOTE: Apply rules
    ...

    NewBirdPosition += NewBirdVelocity * FrameTime;

    // NOTE: Write into next bird array
    CurrBirdArray[BirdId].Velocity = NewBirdVelocity;
    CurrBirdArray[BirdId].Position = NewBirdPosition;
}

GridClear(Grid);

For updating each entity, we find the grid cells with entities in our specified radius and loop through them, getting the values we require for our 3 boid rules. You'll notice that because we have a linked list of arrays, the looping mechanism looks a bit more complex. Its first stepping through each block, and then each entity in that blocks array. We then clear the grid at the end which frees all the memory that each block_arena held, but does not clear the memory used by platform_block_arena, so we aren't actually giving back memory to the OS (the memory ends up being reused next time we build our grid).

Using the grid setup, I was able to get to 5-6 thousand entities at real time frame rates.

SIMD Small Primer


The idea behind SIMD is that we often run into algorithms that need to do the same operations on different pieces of data (hence, Single Instruction Multiple Data). For example, in image processing we often want to apply the same algorithm to each pixel, like when we apply gaussian blurs or other filters. The CPU exposes instructions that let us do the same instruction on multiple pieces of data at faster rates. For example, we can add 4 float values at the same speed as adding 1 float value (4x speed up). The way SIMD is exposed on x64 CPUs is via the SSE and AVX instruction sets (you can read more about them at https://software.intel.com/sites/landingpage/IntrinsicsGuide/). For this post, I'm limiting myself to any pre AVX SSE extensions for simplicity. 

The SSE instruction set gives us 2 new data types that we will be using, __m128 and __m128i. Both are 128bit vectors but one is a float register while the other is a integer register. We can then pack 4 32bit float values into one __m128 register, or 4 32bit integer values into one __m128i register. After filling in values, we can operate on these vectors, using regular operations like addition, subtraction and multiplication but each operation will do 4 of those at a time. So if we fill one register with values 0, 1, 2, 3 and the other with 4, 5, 6, 7, then a call to _mm_add_epi32 (add 32bit integers) will give us a vector with values 4, 6, 8, 10 (each lane of the vector operates independently of the others). 

When loading values into registers, we generally have 3 options:

1) We can call _mm_set_ps/_mm_set_epi32 which will let us specify the 4 values directly
2) We can call _mm_load_ps/_mm_load_si128 which will do a aligned load from memory. The requirements here is that the passed pointer is aligned to 16 bytes, and that the 4 values are contiguous in memory.
3) We can call _mm_loadu_ps/_mm_loadu_si128 which behave as the above but we don't have the alignment requirement. For current hardware, the performance between the 2 calls is the same except in some edge cases (if we cross a cache line boundaries), but we can ignore that for this post since it won't come up.

We have similar requirements for storing (_mm_store_ps/_mm_storeu_ps work like their load counterparts). We want to try and use the load/store calls provided by the instruction set since they are the quickest way to load/store 128bit registers from data in memory. The requirement this imposes on us is that we have to try and keep our data contiguous in memory as much as possible.

SIMDifying Grids

We already have a working grid and we understand how its stored in memory, so now we have to think about how we can apply SIMD to optimize it. The main bottleneck is the entity updating so we will focus on that step (the others are taking a minimal amount of computation in comparison). SIMD likes working on multiple entities at a time (for each loop compute 4 entities at a time instead of 1), so I like to use the following structs in my code:

union v1u_x4
{
    struct
    {
        __m128i x;
    };
        
    u32 e[4];
};

union v1i_x4
{
    struct
    {
        __m128i x;
    };
    
    i32 e[4];
};
 
union v1_x4
{
    struct
    {
        __m128 x;
    };
    
    f32 e[4];
};

union v2_x4
{
    struct
    {
        v1_x4 x, y;
    };

    v1_x4 e[2];
};

Using the above lets me more easily write code that works on multiple entities at a time, instead of trying to SIMDify updating one entity. This also happens to mimic more closely what GPU execution looks like. If you've written shader code before, you'll probably know that each time you write a float add or vec add, its actually happening for 32/64 elements at a time, but the programming model lets you write the code as if its only happening for one. Below is an example of how converting scalar code to SIMD code using my vector types ends up looking:

v2 NewBirdVelocity;
v2 NewBirdPosition;
v2 AvgFlockPos;
v2 AvgFlockDir;

// NOTE: Fly towards center
NewBirdVelocity += DemoState->MoveToFlockWeight * (AvgFlockPos - NewBirdPosition);
// NOTE: Avoid Others
NewBirdVelocity += DemoState->AvoidBirdWeight * AvgFlockAvoidance;
// NOTE: Align Velocities
NewBirdVelocity += DemoState->AlignFlockWeight * (AvgFlockDir - NewBirdVelocity);
                        
// NOTE: Clamp Velocity
f32 BirdSpeed = Clamp(Length(NewBirdVelocity), DemoState->MinSpeed, DemoState->MaxSpeed);
NewBirdVelocity = BirdSpeed * Normalize(NewBirdVelocity);
// NOTE: Avoid Terrain
NewBirdVelocity += DemoState->AvoidTerrainWeight * AvoidWallDir;
// NOTE: Update Position
NewBirdPosition += NewBirdVelocity * FrameTime;

=> 

v2_x4 NewBirdVelocity;
v2_x4 NewBirdPosition;
v2_x4 AvgFlockPos;
v2_x4 AvgFlockDir;

// NOTE: Fly towards center
NewBirdVelocity += DemoState->MoveToFlockWeight * (AvgFlockPos - NewBirdPosition);
// NOTE: Avoid Others
NewBirdVelocity += DemoState->AvoidBirdWeight * AverageData.AvgFlockAvoidance;
// NOTE: Align Velocities
NewBirdVelocity += DemoState->AlignFlockWeight * (AvgFlockDir - NewBirdVelocity);
                        
// NOTE: Clamp Velocity
v1_x4 BirdSpeed = Clamp(Length(NewBirdVelocity), V1X4(DemoState->MinSpeed), V1X4(DemoState->MaxSpeed));
NewBirdVelocity = BirdSpeed * Normalize(NewBirdVelocity);
// NOTE: Avoid Terrain
NewBirdVelocity += DemoState->AvoidTerrainWeight * AvoidWallDir;
// NOTE: Update Position
NewBirdPosition += NewBirdVelocity * FrameTime;

Since I'm designing the code to operate on 4 entities at a time, its usually enough for computation code to just have its types renamed and you get free parallelization (you still have to figure out how to load the correct values into the SIMD registers and store them after which we will go over below) The above SIMD code does the same thing as the scalar code but it computes it for 4 entities at a time.

The part where CPU SIMD gets more complicated over GPU is in memory operations. GPUs support arbitrary gather/scatter while CPUs do not (at least for SSE). On CPUs, the only things we have in SSE are aligned and unaligned loads/stores. We can however mimic gather/scatter operations by doing the memory operations in scalar code, but it is much slower than the SSE load/stores, so we want to avoid them as much as possible.

inline v1_x4 V1X4(f32 X, f32 Y, f32 Z, f32 W)
{
    v1_x4 Result = {};
    Result.x = _mm_set_ps(W, Z, Y, X);
    return Result;
}

inline v1_x4 V1X4Gather(f32* X, f32* Y, f32* Z, f32* W, v1u_x4 Mask)
{
    v1_x4 Result = V1X4(Mask.e[0] ? *X : 0, Mask.e[1] ? *Y : 0, Mask.e[2] ? *Z : 0, Mask.e[3] ? *W : 0);
    return Result;
}

With grids, we are accessing some random list of entities stored in each cell that can change frame to frame, so we would expect over time that it would get more and more incoherent as entities randomly move around. If we just take the code we have as is and SIMDify it, we basically have to use gather/scatter if we want to load 4 entities at a time, which reverts to scalar code and is much less efficient.

The solution I came up with was a bit of a lucky accident. I noticed that I'm double buffering all my entities already, and the grid is recreated every frame, so instead of writing entity at index 0 from the read buffer to index 0 in the write buffer, I'm going to write it in a memory location so that its nearby other entities in its cell. Below is some scalar pseudocode to illustrate the idea:

// NOTE: Loop over all grids and all entities in the grids
u32 GlobalIndexId = 0;
for (u32 GridY = 0; GridY < Grid->NumCellsY; ++GridY)
{
    for (u32 GridX = 0; GridX < Grid->NumCellsX; ++GridX)
    {
        grid_cell* CurrCell = Grid->Cells + GridY * Grid->NumCellsX + GridX;

        u32 BirdBlockIndexId = 0;
        for (block* CurrBlock = CurrCell->IndexArena.Next; CurrBlock; CurrBlock = CurrBlock->Next)
        {
            u32* BlockIndices = BlockGetData(CurrBlock, u32);
            u32 NumIndicesInBlock = Min(CurrCell->NumIndices - BirdBlockIndexId, Grid->MaxNumIndicesPerBlock);
            for (u32 IndexId = 0; IndexId < NumIndicesInBlock; ++IndexId)
            {
                u32 CurrBirdId = BlockIndices[IndexId];

                // NOTE: Load bird data
                v2 NewBirdPosition = PrevBirdArray[CurrBirdId].Pos;
                v2 NewBirdVelocity = PrevBirdArray[CurrBirdId].Vel;

                // NOTE: Get average data
                ...
                    
                // NOTE: Apply rules
                ...
                    
                NewBirdPosition += NewBirdVelocity * FrameTime;
                
                CurrBirdArray[GlobalIndexId].Vel = NewBirdVelocity;
                CurrBirdArray[GlobalIndexId].Pos = NewBirdPosition;
                GlobalIndexId += 1;
            }

            BirdBlockIndexId += NumIndicesInBlock;
        }
    }
}

The first thing you might notice is that I'm not iterating my array of entities element by element, but instead I'm iterating grid cell by grid cell. This also lets me not have to store a temporary array of grid cell ids for each entity. The other addition is I'm maintaining a global write index as I write my entities into the write array from the read array. This means that I write out the entities in traversal order, which happens to be grid cell by grid cell, so for the most part, I will have entities in each grid cell in contiguous blocks of memory which is exactly what SIMD likes! When I SIMDify this code, I can check if the 4 indices I load are sequential, and if they are do a regular load, otherwise do a gather for the edge cases which are not very common. This ends up being the following:

struct bird_array
{
    f32* PosX;
    f32* PosY;
    f32* VelX;
    f32* VelY;
};

inline u32 MoveMask(v1u_x4 V)
{
    // NOTE: Outputs 1 bit for each element if its most sig bit is true
    u32 Result = 0;
    Result = _mm_movemask_epi8(V.x);
    Result = ((Result >> 3) & 0x1) | ((Result >> 6) & 0x1) | ((Result >> 10) & 0x1) | ((Result >> 12) & 0x1);

    return Result;
}

...

// NOTE: Update birds
u32 GlobalIndexId = 0;
for (u32 GridY = 0; GridY < Grid->NumCellsY; ++GridY)
{
    for (u32 GridX = 0; GridX < Grid->NumCellsX; ++GridX)
    {
        grid_cell* CurrCell = Grid->Cells + GridY * Grid->NumCellsX + GridX;

        u32 BirdBlockIndexId = 0;
        for (block* CurrBlock = CurrCell->IndexArena.Next; CurrBlock; CurrBlock = CurrBlock->Next)
        {
            u32* BlockIndices = BlockGetData(CurrBlock, u32);
            u32 NumIndicesInBlock = Min(CurrCell->NumIndices - BirdBlockIndexId, Grid->MaxNumIndicesPerBlock);
            for (u32 IndexId = 0; IndexId < NumIndicesInBlock; IndexId += 4)
            {
                v1u_x4 CurrBirdId = V1UX4LoadUnAligned(BlockIndices + IndexId);
                u32 FirstBirdId = CurrBirdId.e[0];

                // NOTE: Check if we can do a load or if we have to gather
                v1u_x4 BirdValidMask = (V1UX4(IndexId) + V1UX4(0, 1, 2, 3)) < V1UX4(NumIndicesInBlock);
                v1u_x4 AlignedMask = {};
                {
                    v1u_x4 FirstBirdVec = V1UX4(FirstBirdId);
                    AlignedMask = (CurrBirdId - FirstBirdVec) == V1UX4(1);
                }

                // NOTE: Load bird data
                v2_x4 NewBirdPosition = {};
                v2_x4 NewBirdVelocity = {};
                {
                    if (MoveMask(BirdValidMask) == 0xF && MoveMask(AlignedMask) == 0xF)
                    {
                        // NOTE: We can do aligned loads here
                        NewBirdPosition = V2X4LoadUnAligned(PrevBirdArray.PosX + FirstBirdId, PrevBirdArray.PosY + FirstBirdId);
                        NewBirdVelocity = V2X4LoadUnAligned(PrevBirdArray.VelX + FirstBirdId, PrevBirdArray.VelY + FirstBirdId);
                    }
                    else
                    {
                        // NOTE: We have to do a masked gather here
                        NewBirdPosition = V2X4Gather(PrevBirdArray.PosX, PrevBirdArray.PosY, CurrBirdId, BirdValidMask);
                        NewBirdVelocity = V2X4Gather(PrevBirdArray.VelX, PrevBirdArray.VelY, CurrBirdId, BirdValidMask);
                    }
                }
                                                                                
                bird_average_data AverageData = GridGetAverageData(Grid, PrevBirdArray, NewBirdPosition,
                                                                   CurrBirdId, BirdValidMask);
                                        
                // NOTE: Apply rules
                ...
                    
                NewBirdPosition += NewBirdVelocity * FrameTime;
                                            
                // NOTE: Write into next bird array (we do unaligned store since sometimes we store less than 4
                // elements and then next write wont be aligned
                StoreUnAligned(NewBirdVelocity.x, CurrBirdArray.VelX + GlobalIndexId + IndexId);
                StoreUnAligned(NewBirdVelocity.y, CurrBirdArray.VelY + GlobalIndexId + IndexId);
                StoreUnAligned(NewBirdPosition.x, CurrBirdArray.PosX + GlobalIndexId + IndexId);
                StoreUnAligned(NewBirdPosition.y, CurrBirdArray.PosY + GlobalIndexId + IndexId);
            }

            BirdBlockIndexId += NumIndicesInBlock;
            GlobalIndexId += NumIndicesInBlock;
        }
    }
}

There are a couple of points to make about the above code: 

1) I rewrote the entity storage code to be SoA (struct of arrays) so that we can easily load our entity attributes using loads/stores and not have to do any shuffling. The arrays are also padded to the size of SSE vector registers. This way, when I have a v2_x4, its loading 4 sequential x values into one __m128 register, and 4 sequential y values into another __m128 register (everything is cleanly separated).

2) We can run into cases where a given cell has 3 entities for example, so when we try and load 4, one will be invalid and we need to track it and ignore its results. We use a mask to do that which is generated via movemask. 

3) We also use a movemask to figure out if we have sequential indices and can use a unaligned load or if we have to go into the slow path and do a gather.

4) When we store our entities, we always store all 4 of them, even if we only had a 1, 2, or 3 which are valid. The reason this works out and we don't later read garbage data is because we update the GlobalIndexId by only the # of valid entities, so the next loop iteration will overwrite the garbage ones we stored.

The other big change with SIMD is that all if statements that were used for computation have to be replaced with masks that cancel out unwanted values. 

v2 DistanceVec = NearbyBird->Position - NewBirdPosition;
f32 DistanceSq = LengthSquared(DistanceVec);

if (DistanceSq < DemoState->BirdRadiusSq)
{
    NumBirdsInRadius += 1;
                                
    // NOTE: Velocity Matching
    AvgFlockDir += NearbyBird->Velocity;

    // NOTE: Bird Flocking
    AvgFlockPos += NearbyBird->Position;
}
if (DistanceSq < DemoState->AvoidRadiusSq)
{
    // NOTE: Avoidance
    AvgFlockAvoidance += -DistanceVec;
}

=>

v2 DistanceVec = NearbyBird->Position - NewBirdPosition;
f32 DistanceSq = LengthSquared(DistanceVec);

bool BirdRadiusMask = DistanceSq < DemoState->BirdRadiusSq;
NumBirdsInRadius += BirdRadiusMask;
                                
// NOTE: Velocity Matching
AvgFlockDir += BirdRadiusMask * NearbyBird->Velocity;

// NOTE: Bird Flocking
AvgFlockPos += BirdRadiusMask * NearbyBird->Position;

bool AvoidRadiusMask = DistanceSq < DemoState->AvoidRadiusSq;
AvgFlockAvoidance += AvoidRadiusMask * -DistanceVec;

This is also something that GPU shader code does for you for free without you having to manually add it into your code. If the condition is false, the bool mask will be 0 so all the computation its multiplied by will be nullified, otherwise it will be equal to 1 and taken as is. 

For the GridGetAverageData, it basically works the same as before, stepping through neighboring cells one entity at a time but we check that 1 entity against all 4 of the entities we previously loaded in each instruction. 

With SIMD, I was able to run between 10-20k entities at real time frame rates. It does depend heavily on how you set your rules so that you don't get too many entities in a single grid cell, sometimes 40k was running at real time frame rates. 

Final Thoughts

I was pleasantly surprised to find how nice and clean this solution is to take advantage of a grid structure and SIMD optimizations. There is still probably a good amount of room for optimizing this code as it currently stands: 

1) I think because I double buffered, I can actually remove the gather in cases where a entity isn't referenced in more than one grid cell, since we are guarenteed coherent reads from last frame. Instead of a gather, it would have to behave as a masked load, so just ignore the extra values that get read and make sure our arrays are padded to avoid reading past allocations.

2) When I get the block average data, I compare 4 entities against 1, but it might be better to compare 4 against 4. It would take a bit more effort but this might end up being faster to do.

3) The code can be multi threaded. Each thread can work on a grid cell or maybe a block of a grid cell. You would have to be more careful with how threads update and store entities but as is, its very possible to do.

4) Right now I walk the grid left to right, bottom to top, but there might be a swizzled pattern that's better to walk since nearby entities might already be in cache due to GetAverageData reading nearby grid cells.

5) For the specific problem of boids, I can try to treat it more as a fluid sim and write velocity/position values into textures and do bilinear filtering to get average data instead of waking nearby entities for each entity. This would change the sim output but it would still approximate it well. I didn't want to do this because I wanted the solution I came up with to also apply to other problems like pathfinding and collision, but that would probably be much faster if we are just concerned with boids.

6) Since we already store entities in a grid, we can try to store entity positions as 8/16 bit fixed point values relative to their grid cell. You would have to make sure that precision is good enough but it can help reduce memory traffic if that is a bottleneck.

etc..

I think you can extend this to other acceleration structures or for other game problems but there is a important gotcha that would need to be taken care of. The main idea behind my optimization is to write entities in a different order than what they previously were in. This would break in any system that needs to store references to these entities since their position in the array is not stable. There are a few ways around this that I don't think would nullify this approach:

1) Generally we already have a handle system in place for storing references to entities. After we are done moving entities around, we can have each entity remember which slot they are in the handle system, and update its index to the new entity index in the array. This would require moving all the other entity data to new indexes which might be inefficient.

2) Do a similar approach as above but have the handle slots store 2 indexes, one for game logic data and one for sim data. This way we still update the sim data index but we avoid having to copy around all the other entity game logic data.

This kind of optimization also appears in different contexts. I recently started to try and write a 2d graph visualizer and you run into similar problems where each node is affected in some way based on nearby nodes, and you want to use a acceleration structure to speed that up. In that case I'm writing it on the GPU so maybe I'll write a future blog post about that. 

For now, you can find some code for this demo here: https://github.com/Ihaa21/Boids The demo might have some todo's and things that I haven't gotten around too but you can use it as a reference for how I went about optimizing this algorithm.