Wednesday, 13 June 2018

Sparse Voxel Octree Renderer (Fluffy)

Fluffy is a realtime CPU sparse voxel octree renderer which is coded in C++ using AVX instructions. The renderer supports real time rendering of arbitrarily oriented octree's with per pixel detail on a 512 by 512 screen using a single core. I used Intel VTune amplifier to profile and improve the code for this project. The code for this project can be found at https://github.com/Ihaa21/Fluffy. Below I will be detailing the process of optimizing this project from 800ms per frame to 20ms per frame.

Rendering Algorithm:

Fluffy is a attempt at building a rendering algorithm where instead of the run time being decided by model complexity, it is decided by the size of the screen. This is done by hierarchically traversing models stored as octrees, and occluding as many nodes and their children as possible.

The renderer begins by projecting the root node to the screen and checking if the node is occluded. If it isn't we check if the node is of a single pixels size. If it is, we render it, otherwise we split it into its 8 children, and we recursively apply the above algorithm in a front to back order. This way, we are only traversing nodes which are visible, and if a node isn't visible, it and its children are all rejected, preventing us from doing needless checks. The above algorithm's run time is mostly dependent on the size of the screen instead of the geometry since it only traverses nodes which are visible, and renders nodes as soon as they are of a single pixel size. The pseudo code can be found below:

void RenderOctree(v3 NodeCenter, f32 NodeRadius, octree* Node)
{
    i32 MinX, MaxX, MinY, MaxY;
    ProjectNodesCorners(NodeCenter, NodeRadius, &MinX, &MaxX, &MinY, &MaxY);

    MinX = ClipMinX(MinX);
    MaxX = ClipMinX(MaxX);
    MinY = ClipMinX(MinY);
    MaxY = ClipMinX(MaxY);

    if (IsNodeOccluded(MinX, MaxX, MinY, MaxY))
    {
        return;
    }

    f32 ChildRadius = NodeRadius / 2.0f;
    v3 ChildrenCenters[8] = GetChildrenCenters(NodeCenter, NodeRadius);

    // NOTE: Sort children centers such that we traverse them front to back
    f32 Keys[8] =
    {
        Length(ChildrenCenters[0]),
        Length(ChildrenCenters[1]),
        Length(ChildrenCenters[2]),
        Length(ChildrenCenters[3]),
        Length(ChildrenCenters[4]),
        Length(ChildrenCenters[5]),
        Length(ChildrenCenters[6]),
        Length(ChildrenCenters[7]),
    };
    u32 Order = {0, 1, 2, 3, 4, 5, 6, 7};
    SortLeastToGreatest(Keys, Order);

    for (u32 ChildId = 0; ChildId < 8; ++ChildId)
    {
        u32 SortedId = Order[ChildId];

        RenderOctree(ChildrenCenters[SortedId], ChildRadius, Node->Children[SortedId]);
    }
}

I visualized how many recursions the algorithm above needs to perform for every pixel and we can see the results below (different colors represent different number of recursions):

The picture on the left is when I used a float (not rounded) min/max values to decide if a node is small enough to render. The picture on the right is when I used the rounded integer min/max values to decide if a node is small enough to render. We can see that there are lots of nodes in the right picture which need fewer recursions than the picture on the left.

Algorithm Optimizations

Quicker projection calculation

In the above renderer pseudo code, we treat each octree node as a cube and project its 8 corners onto the screen to calculate the min/max bounds that the cube takes up on screen. Doing so is a very slow operation, and had to be executed millions of times per frame. To speed this process up, I pretend every oriented octree node is a actually the smallest screen facing cube that encapsulates that node. I would then calculate which corners of this cube correspond to the min/max values of the nodes projected space.
In certain orientations, the corners which correspond to the min/max values change as can be seen in the images below:
To speed up cube projection, I came up with 2 algorithms. Algorithm 1 was to project the min/max corners of the front and back face, and then calculate min/max values from the 4 projected points. This halved the amount of corners I had to project. Algorithm 2 projected only 2 x,y value pairs but would add extra if statements to figure out which corners x,y values corresponded to the projected cubes min/max values. In the above image for example, the center cube has 3 points which correspond to the cubes min/max values on screen. But, the top left corner only needs its y value projected to calculate the max y, and the top right corner only needs its x value projected to calculate the max x. The algorithm would identify such cases depending on which faces of the cube are visible on screen. Reviewing benchmarks below, we can see that algorithm 2 turned out to be the faster algorithm.

MeanAverageStDevMinMax
Scalar211.883003212.75650424.124100585205.675797230.36496
Algorithm 175.26851775.540828731.31218409173.54123784.518555
Algorithm 274.27075274.431776880.98439302872.26805985.476723

Minimizing Sorting:


Given that the program traverses millions of octree nodes a frame, it can't feasibly sort every nodes children to be ordered front to back. At the point I decided to examine this issue, the sorting was taking roughly 25% of the programs run time. If a octree node is rotated, the program finds the largest encapsulating box that is facing the camera, and renders that box instead. When drawing these nodes, I realized most of the time, the sorted order of the parent is the same as the child's, but I also identified cases where that isn't true shown in the images below:

I simplified the above example to a 2d case instead of 3d, but the same principles apply. We can see that the parent node on the left has 4 nodes, ABCD. Nodes A, B and nodes C, D are both the same distance away from the camera so to traverse our nodes front to back, we can traverse A->B->C->D or we can traverse A->B->D->C or any other permutation where we flip A with B or C with D. If we then examine the children of B, we see that node 0 is strictly closer to the camera than node 1, and node 2 is strictly closer than node 3. So there is only one possible order. This is an example of a order changing from the parent to the children.

I realized later that the only times that the traversal order for a parent changes for the children is when the faces of the parent which are visible are different than the faces of the children. This change of visible faces only happens when the parent node intersects the x or y axis, while the child node does not as can be seen in the below picture:

Therefore, I only have to sort nodes when they intersect the x or y axis, otherwise I can propogate the parents traversal order onto the children. When profiling, almost 100% of our nodes don't fall into this category and we don't even spend 1% of our programs runtime sorting nodes anymore.

Orthographic Mode:

The last optimization is to try to render as many nodes using orthographic projection as possible. When performing perspective projection, we have to store world positions of nodes and perform divisions to project nodes to the screen. This means we use up more memory and have to wait for long latency instructions which can stall our CPU pipelines. So Fluffy checks if a nodes backface projects to the same pixels as the front face, which means that the node is so small that the effects of perspective projection are no longer visible. This is shown in the below images:

Fluffy checks if nodes are ready for orthographic projection by checking if the top right corners of the front and back faces of the cube project to the same pixel. If they do, it switches to orthographic mode and subdivides the node in screen space by finding midpoints between every edge as shown in the below image:

Below are the performance results of applying this method.


MeanAverageStDevMinMax
Scalar175.455917174.889116219.78129413127.600182217.117813
Ortho1148.587189149.605005216.38707136107.362137182.3797

SIMD Optimizations

Processing 8 nodes at a time

Originally, I tried to apply SIMD on the original algorithm which was outlined above, which meant Fluffy was processing one node at a time. This resulted in lots of horizontal operations like finding the min/max value in a AVX vector. I decided to instead process 8 nodes at a time doing as much of it in SIMD as possible, then switching to scalar for things like checking node occlusion. This helped make the code more parallel and easier to implement in SIMD.

Screen as 8x8 blocks

The algorithm touches many pixels to render the octree to the screen. I knew that every row of pixels touched by a node would result in a cache miss once we had to jump to the next row of pixels. I tried to group pixels into 8x8 blocks, each being a 1 bit coverage mask instead of a 32 bit depth value so that I could lower the amount of cache misses in coverage buffer access. This however incurs a performance hit to generate masks whenever we loop per pixel since sometimes our projected cube partially covers a 8x8 block of pixels. The performance results can be seen below:
MeanAverageStDevMinMax
Scalar Occlusion50.59556250.441757199.01711967423.43388768.161186
8x8 Occlusion61.35755261.0286472410.9266814127.88140183.063866

Surprisingly, this method made the performance worse because of the extra instructions it took to generate the pixel masks. In the end, I decided to use a 8 bit coverage mask with a scalar loop to check for node occlusion to avoid any extra operations to generate masks, while shrinking my rendered buffer by a factor of 4 (from 32 bit depth).