Virtualized Geometry Systems: Architecture, Theory, and Implementation

May 1, 2026 · 41 min read

Founder of Graphical Playground

A Principal Engineer's Guide to Production-Grade Micro-Polygon Rendering

"The geometry problem is not one of bandwidth alone, it is a problem of indirection, coherence, and the fundamental mismatch between the triangle and the pixel."

Abstract

Modern game engines face an intractable triangle throughput ceiling: GPU rasterizers are designed around polygons covering several pixels, yet cinematic assets routinely produce triangles smaller than a single pixel at runtime distances. Classical Level-of-Detail (LOD) pipelines fail at this scale due to popping artifacts, authoring overhead, and the inability to leverage the full spatial resolution of sculpted meshes. This paper covers Virtualized Geometry Systems (VGS), the family of techniques that includes Epic Games' Nanite (Unreal Engine 5), Ubisoft's micropolygon pipeline, and related academic precursors such as Reyes, Progressive Meshes, and Streaming Mesh systems. Topics include cluster-based hierarchical level-of-detail, screen-space error metrics, BVH/DAG visibility culling, software rasterization fallback paths, virtual shadow maps, and GPU-driven draw dispatch. Included is a concrete implementation blueprint targeting a C++23/Vulkan 1.3 engine, with pseudocode, data layout specifications, and complexity analyses.

Keywords: virtualized geometry, LOD, BVH, cluster hierarchy, software rasterization, meshlet, screen-space error, GPU-driven rendering, visibility buffer, nanite

Introduction and motivation
Background and related work
Theoretical foundations
Cluster hierarchy construction
Run-time visibility determination
Rasterization pipeline
Material evaluation and deferred shading
Streaming and virtualization
Virtual shadow maps
Implementation architecture
Performance analysis and benchmarks
Advanced topics (state of the art, 2026)
References

1. Introduction and motivation

1.1 The polygon-pixel mismatch

The fundamental unit of GPU rasterization is the triangle. A triangle covers a contiguous set of pixels, and the rasterizer emits one or more quads (2×2 pixel blocks) per triangle to guarantee correct partial-derivative computation for texture sampling. This design carries an implicit assumption: triangles should cover multiple pixels.

Let a mesh $M$ have $N_T$ triangles. At camera distance $d$ , the projected area of a single triangle in screen space is, on average:

$\bar{A}_{px} = \frac{A_{ws} \cdot \cot^2(\theta_{fov}/2)}{d^2 \cdot W_r \cdot H_r}$

where $A_{ws}$ is the average world-space triangle area, $\theta_{fov}$ is the vertical field of view, and $(W_r, H_r)$ is the render resolution. When $\bar{A}_{px} \ll 1$ , the rasterizer is forced to test entire quads (4 pixels) to shade a fraction of one pixel, the classical overdraw-vs-underdraw inefficiency known as quad overshading.

For a typical film-quality asset with $N_T = 10^7$ triangles rendered at $d = 10$ meters:

$\bar{A}_{px} \approx 0.003 \text{ px}^2$

This is three orders of magnitude below the break-even threshold. The GPU wastes approximately 99.7% of fragment shader invocations on invisible or sub-pixel geometry. The Nanite whitepaper [Karis 2021] quantifies this: the naive pipeline achieves roughly 1 triangle per 20 cycles under these conditions, whereas the theoretical peak is 1 triangle per 2 cycles.

1.2 Limitations of classical LOD

The traditional solution is Discrete LOD (DLOD): artists hand-author 4–6 mesh resolutions, and the engine selects one per object per frame. This fails for the modern production pipeline in four fundamental ways:

Authoring cost scales linearly with the number of unique mesh assets, which now numbers in the tens of thousands for open-world titles.
Popping artifacts occur at LOD transition boundaries, requiring expensive cross-fade blending that still fails under motion.
Memory redundancy: each LOD copy stores topology independently, inflating VRAM usage by a factor of $\approx 1.3$ – $1.8\times$ .
Granularity: DLOD operates per-object, ignoring the fact that different regions of a single mesh may require different resolutions simultaneously (e.g., a building visible both up close and in a distant reflection).

Continuous LOD (CLOD) schemes such as Progressive Meshes [Hoppe 1996] address authoring and popping at the cost of GPU-unfriendly sequential edge-collapse dependencies. They do not parallelize.

1.3 Thesis

We argue that the correct solution decouples geometry storage from geometry rendering via a four-layer architecture:

[Offline] Cluster Hierarchy Build → Streaming Asset
    ↓
[Load-time] Hierarchical BVH Construction
    ↓
[Frame] GPU-Driven Cluster Selection (screen-space error)
    ↓
[Frame] Software + Hardware Rasterization of selected clusters

Each layer is independently scalable, cache-friendly, and GPU-parallelizable.

2.1 Reyes and micropolygon rendering

The Reyes rendering architecture [Cook et al. 1987] introduced the concept of micropolygons: sub-pixel-sized quads generated by dicing parametric surfaces. Reyes dices surfaces until each patch produces quads of exactly 0.5 pixel in screen space, achieving theoretically optimal triangle density. Reyes is offline only, as dicing is CPU-bound and generates geometry on-the-fly. Our work brings this ideal to real-time.

2.2 Progressive Meshes

Hoppe's Progressive Meshes [1996] represent a mesh as a base coarse mesh plus a sequence of vertex splits $V = \{v_0, v_1, \ldots, v_k\}$ . Any prefix of $V$ produces a valid mesh approximation. This gives a continuous, view-independent LOD stream, but the sequential nature of vertex splits prevents GPU parallelism. The dependency graph of vertex splits is a DAG with depth $O(N_T)$ , preventing batch execution.

2.3 View-dependent LOD

Xia and Varshney [1996] and Hoppe [1997] introduced view-dependent LOD that adapts per-region. A multi-resolution mesh stores vertex splits annotated with screen-space error bounds. At runtime, the active front of splits is advanced until error bounds are met everywhere. This achieves per-region adaptation but still runs on the CPU and cannot stream.

2.4 Streaming LOD

Losasso and Hoppe [2004] and later Cignoni et al. [2004] (Batched Multi-Triangulation) introduced streaming mesh representations that load geometry on demand. The core insight, that LOD levels can be stored as independent clusters with known error bounds, directly underpins Nanite's design.

2.5 Meshlets and task/mesh shaders

NVIDIA's introduction of Task/Mesh Shaders (Turing, 2018) and the subsequent Vulkan VK_NV_mesh_shader / VK_EXT_mesh_shader extension provided the GPU programming model that makes GPU-side cluster selection tractable. A meshlet [Wihlidal 2018, Strugar 2021] is a fixed-size cluster of up to 64/128 triangles with a precomputed bounding cone for backface culling. Mesh shaders allow a compute-like amplification stage that selects which meshlets to rasterize per-thread-group.

2.6 Nanite (Unreal Engine 5)

Epic's Nanite [Karis 2021] is the production system that combined these ideas at scale. Contributions:

A hierarchical cluster DAG built offline, with LOD selection granularity finer than any previous system.
Software rasterization of small triangles via a compute shader path, avoiding hardware rasterizer setup overhead.
Persistent threads and a two-pass GPU culling architecture.
Integration with Virtual Shadow Maps for high-quality shadow rendering.

2.7 Ubisoft's micropolygon pipeline

Ubisoft's Rainbow Six Siege and Avatar: Frontiers of Pandora teams developed independent micropolygon systems [Wihlidal 2015, Méndez-Feliu 2023]. Their approach emphasizes:

Displacement map dicing at render time rather than offline baking.
Parametric surface evaluation on the GPU, generating geometry in compute shaders.
Integration with their proprietary Snowdrop engine's material system.

3. Theoretical foundations

3.1 Screen-space error metric

The central decision in any LOD system is: at what point is a simplified representation indistinguishable from the original? We define the geometric error of a cluster $C$ relative to its parent cluster $P$ as:

$\varepsilon(C) = \max_{p \in P} d(p, \hat{P}_C)$

where $\hat{P}_C$ is the closest-point projection of the parent surface onto the simplified cluster surface, and $d(\cdot, \cdot)$ is Euclidean distance. This is the Hausdorff distance between the two surfaces:

$\varepsilon(C) = d_H(C, P) = \max\left( \sup_{x \in C} \inf_{y \in P} \|x-y\|, \sup_{y \in P} \inf_{x \in C} \|x-y\| \right)$

The projected screen-space error at camera distance $d$ is:

$\varepsilon_{px}(C, d) = \varepsilon(C) \cdot \frac{\cot(\theta_{fov}/2)}{d} \cdot \frac{H_r}{2}$

This is the number of pixels by which the simplified cluster could differ from ground truth at distance $d$ . We define a threshold $\varepsilon_{max}$ (typically 1.0 pixel) and select the coarsest cluster $C^*$ such that:

$\varepsilon_{px}(C^*, d) \leq \varepsilon_{max}$

Equivalently, we compute the minimum viewing distance at which cluster $C$ is acceptable:

$d_{min}(C) = \varepsilon(C) \cdot \frac{\cot(\theta_{fov}/2) \cdot H_r}{2 \cdot \varepsilon_{max}}$

This distance is precomputed offline and stored per cluster as a single float. At runtime, the LOD selection reduces to a comparison: render cluster $C$ if $d > d_{min}(C)$ .

3.2 Cluster invariant

For a valid cluster hierarchy $\mathcal{H}$ , the following invariant must hold for any node $C$ with parent $P(C)$ :

$\varepsilon(P(C)) \geq \varepsilon(C)$

and:

$\text{Bounds}(P(C)) \supseteq \text{Bounds}(C)$

The first condition ensures monotone error growth up the hierarchy (coarser = more error). The second ensures spatial containment. These two properties guarantee that the LOD selection produces a cut through the hierarchy, a set of non-overlapping clusters that together cover the entire surface exactly once at the chosen resolution.

Lemma (Cut Correctness): Let $\mathcal{F} = \{C \in \mathcal{H} : \varepsilon_{px}(C, d) \leq \varepsilon_{max}$ and $\varepsilon_{px}(P(C), d) > \varepsilon_{max}\}$ be the set of clusters selected by the LOD criterion. If the hierarchy invariant holds, then $\mathcal{F}$ is a partition of the original surface, and $\max_{C \in \mathcal{F}} \varepsilon_{px}(C, d) \leq \varepsilon_{max}$ .

Proof sketch: By induction on hierarchy depth. The root has $\varepsilon(root) > \varepsilon_{max}$ for all $d$ below a threshold. The leaves are the original clusters with $\varepsilon = 0$ . The selection criterion picks exactly the frontier between acceptable and unacceptable, which is a valid cut by the monotonicity invariant. $\square$

3.3 Cluster size and cache coherence

A cluster stores $N_{tri}$ triangles. The optimal $N_{tri}$ balances:

Culling granularity: smaller clusters → finer culling → less overdraw
Culling overhead: smaller clusters → more BVH nodes → more culling work
Cache line utilization: GPU L1 cache lines are 128 bytes; vertex data should fill an integer number of cache lines

For a vertex storing position (12B), normal (12B), UV (8B), and tangent (8B) = 40 bytes, a cache line holds 3.2 vertices. A cluster of $N_v$ vertices spans:

$L_{cache}(N_v) = \left\lceil \frac{40 N_v}{128} \right\rceil \text{ cache lines}$

NVIDIA documentation recommends $N_{tri} \leq 124$ , $N_v \leq 64$ [Wihlidal 2018]. The Nanite system uses $N_{tri} = 128$ , $N_v = 128$ per cluster, noting that larger clusters amortize per-cluster overhead better for their software rasterizer.

3.4 Spatial hashing and Morton codes

During hierarchy traversal, we need to quickly locate the cluster bounding a query point. We use Morton codes (Z-order curve) to map 3D cluster centroids to a 1D key that preserves spatial locality:

$\text{Morton}(x, y, z) = \text{interleave}(\lfloor x/\delta \rfloor, \lfloor y/\delta \rfloor, \lfloor z/\delta \rfloor)$

For a 64-bit Morton code with 21 bits per axis, the interleaving function is:

$M(i) = \sum_{k=0}^{20} ((i \gg k) \& 1) \ll 3k$

This can be computed in $O(1)$ via bit-manipulation (the classic "magic bits" technique):

constexpr uint64_t morton_expand3(uint32_t v) noexcept {
    uint64_t x = v & 0x1fffff;
    x = (x | x << 32) & 0x1f00000000ffff;
    x = (x | x << 16) & 0x1f0000ff0000ff;
    x = (x | x <<  8) & 0x100f00f00f00f00f;
    x = (x | x <<  4) & 0x10c30c30c30c30c3;
    x = (x | x <<  2) & 0x1249249249249249;
    return x;
}

constexpr uint64_t morton3(uint32_t x, uint32_t y, uint32_t z) noexcept {
    return morton_expand3(x) | (morton_expand3(y) << 1) | (morton_expand3(z) << 2);
}

Sorting clusters by Morton code produces a spatially coherent ordering that maximizes BVH build efficiency and GPU cache hit rate during traversal.

4. Cluster hierarchy construction

4.1 Overview

The cluster hierarchy is built offline as part of the asset pipeline. The input is a full-resolution triangle mesh $M$ with $N_T$ triangles. The output is a tree of cluster groups $\mathcal{G} = \{G_0, G_1, \ldots, G_k\}$ stored in a compact binary format. The build pipeline has four stages:

Stage 1: Base Clustering     M → {C_0, ..., C_n}   (meshlet decomposition)
Stage 2: Group Formation     Clusters → Groups       (spatial grouping)  
Stage 3: Group Simplification Groups → Simplified    (edge collapse within group)
Stage 4: Hierarchy Assembly  All levels → DAG        (parent pointers + error)

4.2 Stage 1: Base clustering (meshlet decomposition)

We partition the input mesh into meshlets, clusters of at most $N_{tri}^{max} = 128$ triangles sharing a connected patch of surface. The objective is to minimize the surface area of each cluster's bounding volume while respecting the triangle count limit.

Algorithm (Greedy Meshlet Builder):

function BuildMeshlets(M, N_tri_max, N_vert_max):
    mark all triangles as unassigned
    while unassigned triangles exist:
        seed = pick_unassigned_triangle()
        C = new Cluster(seed)
        frontier = adjacency_ring(seed)
        while frontier not empty and |C| < N_tri_max:
            t = frontier.pop_best(C)  // minimize AABB growth
            if |C.vertices ∪ verts(t)| <= N_vert_max:
                C.add(t)
                frontier.push(adjacency_ring(t) \ C)
        emit(C)
    return all clusters

The pop_best function selects the frontier triangle that minimizes the increase in the cluster's axis-aligned bounding box. This greedy heuristic achieves tight bounding volumes in $O(N_T \log N_T)$ time.

Formal AABB cost:

$\text{cost}(t, C) = \text{Vol}(\text{AABB}(C \cup \{t\})) - \text{Vol}(\text{AABB}(C))$

An alternative cost that empirically produces better culling efficiency is the surface area heuristic (SAH):

$\text{cost}_{SAH}(t, C) = \frac{\text{SA}(\text{AABB}(C \cup \{t\}))}{\text{SA}(\text{AABB}(C))} - 1$

4.3 Stage 2: Group formation

To enable hierarchical simplification, we group adjacent clusters into cluster groups of size $G_{size}$ (typically 4–8 clusters). A group $G_i$ represents the atomic unit of LOD transition: all clusters in a group are replaced simultaneously by their parent.

Grouping uses a graph partitioning approach. Construct a graph $\mathcal{A}$ where:

Nodes are clusters $\{C_0, \ldots, C_n\}$
Edge $(C_i, C_j)$ has weight equal to the number of shared boundary edges

We seek a partition of $\mathcal{A}$ into groups of size $\approx G_{size}$ that minimizes the total weight of cut edges. This is a variant of the minimum $k$ -cut problem, which is NP-hard in general but well-approximated by METIS [Karypis and Kumar 1998]:

$\min_{\mathcal{P}} \sum_{(i,j) \in \text{cut}(\mathcal{P})} w_{ij}$

subject to $|\mathcal{P}_k| \leq G_{size} \cdot (1 + \delta)$ for all parts $k$ , where $\delta = 0.05$ is a balance tolerance.

Minimizing cut edges directly minimizes the number of boundary edges shared between groups. These boundary edges must not move during simplification to preserve watertightness across group boundaries, they form the locked boundary.

4.4 Stage 3: Group simplification

Within each group $G_i$ , we simplify the interior triangles using Quadric Error Metrics (QEM) [Garland and Heckbert 1997], keeping boundary edges fixed.

Quadric Error Definition:

For a vertex $v$ , let $\mathcal{P}(v)$ be the set of planes of triangles incident to $v$ . The quadric $Q_v$ is the $4 \times 4$ symmetric matrix:

$Q_v = \sum_{p \in \mathcal{P}(v)} \mathbf{p} \mathbf{p}^T$

where $\mathbf{p} = (a, b, c, d)^T$ is the plane equation $ax + by + cz + d = 0$ normalized so that $a^2 + b^2 + c^2 = 1$ . The error of placing a vertex at position $\bar{v} = (x, y, z, 1)^T$ is:

$\varepsilon_{QEM}(\bar{v}) = \bar{v}^T Q_v \bar{v}$

For an edge collapse $e = (v_1, v_2) \to v^*$ , we seek:

$v^* = \arg\min_v \; v^T (Q_{v_1} + Q_{v_2}) \; v$

Setting the gradient to zero:

$\begin{pmatrix} q_{11} & q_{12} & q_{13} & q_{14} \\ q_{12} & q_{22} & q_{23} & q_{24} \\ q_{13} & q_{23} & q_{33} & q_{34} \\ 0 & 0 & 0 & 1 \end{pmatrix} v^* = \begin{pmatrix} 0 \\ 0 \\ 0 \\ 1 \end{pmatrix}$

If this system is singular (e.g., on a flat surface), we fall back to the midpoint $v^* = (v_1 + v_2)/2$ .

Group simplification target: We reduce each group by 50% in triangle count, so each LOD level halves the total triangle count. Given $L$ LOD levels and $N_T^0$ base triangles:

$N_T^L = N_T^0 \cdot 2^{-L}$

The number of levels required to reach a single triangle root is:

$L_{max} = \lceil \log_2(N_T^0) \rceil$

For a $10^7$ triangle mesh: $L_{max} = 24$ . In practice, we stop when a group fits in a single cluster (typically $L \approx 12$ – $16$ ).

Hausdorff Error Tracking:

After each simplification pass on group $G_i$ , we compute and store the Hausdorff distance $\varepsilon(G_i^{L+1}, G_i^L)$ , the maximum geometric deviation from the previous LOD level. The accumulated error from the original surface to level $L$ is:

$\varepsilon_{acc}^L = \max_{l=0}^{L-1} \left( \varepsilon(G_i^{l+1}, G_i^l) + \varepsilon_{acc}^l \right)$

This is the value stored per cluster and used at runtime for the LOD selection test.

4.5 Stage 4: Hierarchy assembly and the DAG structure

After all levels are simplified, we assemble the cluster DAG (Directed Acyclic Graph). Unlike a strict tree, the DAG allows a child cluster to be shared by multiple parents, this occurs at group boundaries where clusters from two groups merge.

The DAG has the following formal properties:

Roots: one or more coarse clusters at the top
Leaves: the original meshlet clusters at the bottom
Interior nodes: simplified cluster groups
Each interior node $N$ has $c \in [2, G_{size}]$ children
Each interior node stores: bounding sphere $(\mathbf{c}, r)$ , $\varepsilon_{acc}$ , $d_{min}$ , child indices, streaming page ID

Bounding sphere fitting:

We use Welzl's algorithm [1991] to compute the minimal enclosing sphere in expected $O(n)$ time. For a cluster's triangle set $\{p_0, \ldots, p_k\}$ :

$\text{MinSphere} = \text{Welzl}(\{p_0, \ldots, p_k\}, \emptyset)$

The resulting sphere $(\mathbf{c}, r)$ is tight and minimal, minimizing false-positive culling by the BVH.

DAG Storage Format:

struct ClusterNode {                   // 64 bytes, fits 2 per cache line
    float3   bounds_center;            //  12B
    float    bounds_radius;            //   4B  (bounding sphere)
    float    lod_error;                //   4B  (ε_acc, world-space)
    float    parent_lod_error;         //   4B  (parent's ε_acc, for cut decision)
    uint32_t flags;                    //   4B  (leaf, streaming, etc.)
    uint32_t page_id;                  //   4B  (streaming page)
    uint32_t cluster_data_offset;      //   4B  (byte offset in page)
    uint32_t children[8];              //  32B  (child node indices, 0=none)
};  // Total: 68B → pad to 72B or store children separately

A compact children-list design stores variable-arity children in a parallel array:

struct ClusterNode {           // 32 bytes
    float3   bounds_center;    // 12B
    float    bounds_radius;    //  4B
    float    lod_error;        //  4B
    uint32_t page_id;          //  4B
    uint32_t first_child;      //  4B  (index into children array)
    uint8_t  child_count;      //  1B
    uint8_t  flags;            //  1B
    uint16_t padding;          //  2B
};                             // 32B = one cache line split

Total hierarchy size for a $10^7$ triangle mesh with $\sim 80,000$ leaf clusters: approximately $80,000 \times (1 + 1/2 + 1/4 + \ldots) \approx 160,000$ nodes $\times$ 32B $= 4.9$ MB, well within budget.

5. Run-time visibility determination

5.1 Two-pass GPU culling architecture

The core runtime algorithm is a two-pass persistent-thread culling pipeline, inspired by Wihlidal's "GPU-Driven Rendering Pipelines" [2015] and refined in Nanite.

Pass 1 (Main Pass, Fast Reject):

Traverse the cluster BVH from the previous frame's visibility state
For each node: test against frustum, occlusion (HZB), and LOD criterion
Emit draw commands for surviving clusters into an indirect draw buffer

Pass 2 (Fallback Pass, Newly Visible):

Re-test clusters that failed occlusion in Pass 1 with the current frame's HZB (just rendered by Pass 1)
Emit additional draw commands for newly unoccluded clusters

This two-pass structure achieves 1-frame-latency occlusion culling at the cost of rendering the same cluster at most twice across both passes.

5.2 Frustum culling

For each cluster node $N$ with bounding sphere $(\mathbf{c}, r)$ , frustum culling is a half-space test against the 6 frustum planes $\{\mathbf{n}_i, d_i\}_{i=0}^5$ :

$\text{inside}(N) = \forall i \in [0,5]: \mathbf{n}_i \cdot \mathbf{c} + d_i > -r$

This requires 6 dot products and 6 comparisons per node, executing at approximately 200M nodes/second on a modern GPU.

Optimized SIMD frustum test (C++ host side for BVH pre-cull):

bool FrustumCullSphere(const __m128 planes[6], __m128 center, float radius) {
    // center = (cx, cy, cz, 1), planes[i] = (nx, ny, nz, d)
    for (int i = 0; i < 6; ++i) {
        float dist = _mm_cvtss_f32(_mm_dp_ps(planes[i], center, 0xF1));
        if (dist < -radius) return false; // outside
    }
    return true;
}

5.3 Hierarchical Z-buffer (HZB) occlusion culling

The HZB is a full mipmap chain of the depth buffer, where each mip level stores the maximum depth (farthest depth value in DirectX/Metal convention, or minimum in OpenGL) of the corresponding 2×2 block.

For a cluster with bounding sphere $(\mathbf{c}, r)$ , the occlusion test proceeds:

Project the sphere to screen space, yielding a bounding rectangle $[x_{min}, x_{max}] \times [y_{min}, y_{max}]$ and a near depth $z_{near}$ .
Select HZB mip level $L$ such that the bounding rectangle covers $\approx 1$ texel: $L = \log_2\left(\max(x_{max} - x_{min}, y_{max} - y_{min})\right)$
Sample the HZB at mip $L$ covering the bounding rect. The sampled depth is $z_{HZB}$ .
If $z_{near} > z_{HZB}$ (cluster is farther than the occluder), cull it.

The depth projection of the sphere's near point (in view space, sphere at $\mathbf{c}_{vs}$ with radius $r$ ):

$z_{near,NDC} = \text{ProjectDepth}(\mathbf{c}_{vs}.z - r)$

using the standard projection matrix entry $P_{33}$ and $P_{34}$ :

$z_{NDC} = \frac{P_{33} \cdot z_{vs} + P_{34}}{z_{vs}}$

5.4 LOD selection on the GPU

After frustum and occlusion culling, surviving nodes are tested against the LOD criterion. The GPU computes the camera distance to the cluster's bounding sphere center:

$d = \|\mathbf{c}_{ws} - \mathbf{cam}_{ws}\| - r$

The cluster is selected (rendered at this level) if:

$d_{min}(C) \leq d \leq d_{min}(P(C))$

where $d_{min}(C)$ is precomputed offline. This is equivalent to:

$\varepsilon_{acc}(C) \cdot K \leq d < \varepsilon_{acc}(P(C)) \cdot K$

with the constant $K = \frac{\cot(\theta_{fov}/2) \cdot H_r}{2 \cdot \varepsilon_{max}}$ computed once per frame and pushed as a uniform.

This double-sided test ensures exactly one level in the hierarchy is selected per surface region, producing the unique cut property proven in §3.2.

5.5 Persistent thread BVH traversal

GPU BVH traversal using a persistent thread group model avoids the overhead of launching separate kernels per hierarchy level. The algorithm maintains a global work queue in device memory:

// Compute shader: persistent_cull.comp
layout(local_size_x = 64) in;

layout(binding = 0) buffer WorkQueue    { uint nodes[]; } work_in;
layout(binding = 1) buffer WorkQueueOut { uint nodes[]; } work_out;
layout(binding = 2) buffer DrawCmds    { DrawIndexedIndirectCommand cmds[]; };
layout(binding = 3) buffer Counters    { uint work_count, draw_count; };

void main() {
    while (true) {
        uint idx = atomicAdd(work_consumed, 1);
        if (idx >= work_count) break;

        uint node_id = work_in.nodes[idx];
        ClusterNode node = fetch_node(node_id);

        if (!FrustumCull(node)) continue;
        if (!HZBCull(node))     continue;

        bool is_leaf    = (node.flags & FLAG_LEAF) != 0;
        bool lod_select = EvalLODCriterion(node, K);

        if (is_leaf || lod_select) {
            // Emit draw command
            if (page_resident(node.page_id)) {
                uint draw_idx = atomicAdd(draw_count, 1);
                emit_draw(draw_idx, node);
            } else {
                request_stream(node.page_id);
            }
        } else {
            // Recurse into children
            for (uint c = 0; c < node.child_count; ++c) {
                uint out_idx = atomicAdd(work_produced, 1);
                work_out.nodes[out_idx] = node.children[c];
            }
        }
    }
}

This runs as a single dispatch with enough thread groups to keep all SMs busy. The total work is $O(N_{visible} \cdot \log N_{total})$ in the worst case, but empirically $O(N_{visible})$ since the BVH prunes most of the tree early.

5.6 Cluster group cut invariant, preventing cracks

The LOD selection must always select all clusters in a group together, or none of them. Selecting partial groups creates gaps at group boundaries because the simplified parent covers different triangles than the original.

We enforce this via the group bounding sphere: all clusters in a group $G_i$ share the same group bounding sphere $(\mathbf{c}_{G_i}, r_{G_i})$ , and the LOD test uses this group sphere, not the individual cluster sphere. This guarantees that the transition threshold is evaluated uniformly for the entire group.

6. Rasterization pipeline

6.1 The sub-pixel triangle problem

When a triangle's screen-space area is less than $A_{min} \approx 4$ pixels (one hardware quad), the GPU rasterizer becomes inefficient:

The triangle still consumes a full hardware rasterizer setup unit (typically 2–4 cycles)
The emitted quad covers 4 pixels, of which 3 may be outside the triangle (75% waste)
Triangle setup itself (edge equations, reciprocal $w$ , attribute gradients) costs $\sim 25$ cycles on current hardware

For a cluster where all 128 triangles are sub-pixel, the hardware rasterizer costs $128 \times 25 = 3200$ cycles for setup alone, versus $128 \times 2 = 256$ cycles of raster work. The setup-to-raster ratio is 12.5×, catastrophically inefficient.

6.2 Software rasterizer design

Nanite's solution is a software rasterizer for small triangles implemented as a compute shader. For triangles covering $\leq 2$ pixels, software is faster for three reasons: hardware setup overhead disappears, multiple triangles fit in a single warp, and writing directly to a visibility buffer via atomics avoids depth test hardware entirely.

Algorithm:

For each triangle $T_j = (v_0, v_1, v_2)$ :

Project vertices to screen space: $s_i = P(v_i) = (x_i, y_i, z_i^{NDC})$
Compute bounding box: $[x_{min}, x_{max}] \times [y_{min}, y_{max}]$ , clamp to screen
For each pixel $(px, py)$ in bounding box:
- Compute barycentric coordinates $(\lambda_0, \lambda_1, \lambda_2)$ via the edge function: $e_{01}(px, py) = (py - y_0)(x_1 - x_0) - (px - x_0)(y_1 - y_0)$
- If $e_{01} \geq 0$ , $e_{12} \geq 0$ , $e_{20} \geq 0$ : pixel is inside triangle
- Compute depth: $z = \lambda_0 z_0 + \lambda_1 z_1 + \lambda_2 z_2$
- Atomically compare-and-swap into the visibility buffer:
```
uint payload = pack(cluster_id, triangle_id);
uint depth_bits = floatBitsToUint(z);
uint packed = (depth_bits & 0xFFFFFF00) | (payload & 0xFF);
// 64-bit atomic: depth in high bits, payload in low bits
atomicMax(visibility_buffer[px + py * W], pack64(depth_bits, payload));
```

Edge function computation (fixed-point for sub-pixel accuracy):

To avoid floating-point gaps, we compute edge functions in 28.4 fixed-point:

int32_t FP(float v) { return (int32_t)(v * 16.0f); }  // 4 subpixel bits

struct EdgeEq {
    int32_t A, B, C;
    EdgeEq(float2 v0, float2 v1) :
        A(FP(v0.y) - FP(v1.y)),
        B(FP(v1.x) - FP(v0.x)),
        C(FP(v0.x) * FP(v1.y) - FP(v0.y) * FP(v1.x)) {}
    int32_t eval(int32_t x, int32_t y) const { return A*x + B*y + C; }
};

6.3 Hybrid rasterization, decision heuristic

We classify each cluster as "small" or "large" and dispatch to the appropriate rasterizer:

$\text{mode}(C) = \begin{cases} \text{software} & \text{if } \bar{A}_{px}(C) < \tau_{sw} \\ \text{hardware} & \text{otherwise} \end{cases}$

where $\tau_{sw} = 4$ pixels² (empirically optimal on most architectures, Karis 2021 uses 2 hardware quads as the threshold).

The classification runs on the GPU: after the culling pass, a compute shader bins clusters into two indirect dispatch/draw buffers, one for software raster, one for hardware. This binning adds $\sim 0.1$ ms overhead but saves $\sim 1$ – $2$ ms on scenes heavy with small triangles.

6.4 Visibility buffer (V-Buffer)

Instead of writing directly to GBuffer attributes (which would require re-fetching vertex data per pixel), we write a Visibility Buffer [Burns and Hunt 2013]:

VisibilityBuffer pixel = { cluster_id (22 bits) | triangle_id (7 bits) | ... }

This 32-bit (or 64-bit with depth) value uniquely identifies the visible triangle at each pixel. A second full-screen pass then:

Reads the cluster/triangle ID from the visibility buffer
Fetches the triangle's vertices from the cluster's vertex buffer
Computes pixel-accurate barycentric coordinates
Evaluates the material graph (PBR BRDF) at this pixel

This decouples geometry rasterization from shading: the rasterizer never touches material data, and each pixel receives exactly one material evaluation regardless of overdraw.

Barycentric coordinate reconstruction from Visibility Buffer:

Given the triangle $(v_0, v_1, v_2)$ and pixel position $(px, py)$ :

$\mathbf{B} = \begin{pmatrix} x_1 - x_0 & x_2 - x_0 \\ y_1 - y_0 & y_2 - y_0 \end{pmatrix}^{-1} \begin{pmatrix} px - x_0 \\ py - y_0 \end{pmatrix}$

Then $(\lambda_1, \lambda_2) = \mathbf{B}$ , $\lambda_0 = 1 - \lambda_1 - \lambda_2$ .

Perspective-correct interpolation requires dividing by $w$ :

$f_{interp} = \frac{\lambda_0 f_0/w_0 + \lambda_1 f_1/w_1 + \lambda_2 f_2/w_2}{\lambda_0/w_0 + \lambda_1/w_1 + \lambda_2/w_2}$

7. Material evaluation and deferred shading

7.1 Material classification problem

A virtualized geometry scene may render thousands of unique materials in a single frame, each requiring different shader code. Naively, a different draw call per material leads to 10,000+ draw calls per frame, unacceptable overhead.

The standard solution is material binning: sort pixels by material ID and execute one compute dispatch per unique material.

Material sort algorithm:

After the visibility buffer pass, read each pixel's cluster ID
Look up the cluster's material ID from a GPU-side table
Write $(material\_id, pixel\_xy)$ pairs into a global buffer
Sort pairs by material ID (GPU radix sort, $O(n)$ passes)
Dispatch one CS thread-group per material, strided across the sorted pixels

Radix sort over $N_{px} = 4K \times 2.25K \approx 9M$ pixels with 16-bit material IDs requires: $\lceil 16/8 \rceil = 2 \text{ passes, each } O(N_{px}) = O(9M)$

At 100 operations/cycle and 10 TFlops/s: $\approx 0.09$ ms for the sort, acceptable.

7.2 Physically-based material evaluation

Each material computes a BRDF contribution. We use the GGX microfacet BRDF [Walter et al. 2007]:

$f_r(\omega_i, \omega_o) = \frac{D(h) G(\omega_i, \omega_o) F(\omega_o, h)}{4 (\mathbf{n} \cdot \omega_i)(\mathbf{n} \cdot \omega_o)}$

where:

$D(h) = \frac{\alpha^2}{\pi ((\mathbf{n} \cdot h)^2 (\alpha^2 - 1) + 1)^2}$ is the GGX NDF, roughness $\alpha$
$G(\omega_i, \omega_o) = G_1(\omega_i) G_1(\omega_o)$ with $G_1(\omega) = \frac{2(\mathbf{n} \cdot \omega)}{(\mathbf{n} \cdot \omega)(2 - \alpha) + \alpha}$ (Smith-GGX)
$F(\omega_o, h) = F_0 + (1 - F_0)(1 - \omega_o \cdot h)^5$ (Schlick Fresnel)

The material compute shader evaluates this per pixel using data fetched from texture arrays (BC7-compressed, 128KB–4MB per material LOD level).

7.3 Programmable shading rate

For regions of constant irradiance (flat surfaces, distant objects), we apply Variable Rate Shading (VRS) to evaluate materials at reduced rate (1×2, 2×2 blocks). The VRS mask is generated from:

Velocity buffer (high motion → full rate)
Depth gradient (geometric discontinuities → full rate)
Luminance variance from previous frame

This reduces material evaluation cost by $1.5$ – $2\times$ on scenes with large flat surfaces.

8. Streaming and virtualization

8.1 Streaming page architecture

Cluster geometry is stored in streaming pages, fixed-size blocks of GPU memory (typically 128KB each) that can be independently loaded and evicted. This design mirrors GPU texture streaming.

Page layout:

Page (128KB):
  Header (256B):
    - page_id       : uint32
    - cluster_count : uint16
    - vertex_count  : uint32
    - index_count   : uint32
    - lod_level     : uint8
    - flags         : uint8
    - checksum      : uint32
  
  Cluster Descriptors (variable, cluster_count × 32B):
    - per-cluster: vertex_offset, index_offset, material_id, error
  
  Vertex Data (variable):
    - position  : float16x3  (6B per vertex)
    - normal    : Oct16       (2B, octahedral encoding)
    - tangent   : Oct16       (2B)
    - uv        : float16x2   (4B)
    Total: 14B per vertex → 9,300 vertices/page maximum
  
  Index Data (variable):
    - uint8 indices (0-127) within page-local vertex pool
    Total: 3B per triangle → 43,000 triangles/page maximum

Vertex quantization significantly reduces bandwidth. Positions are stored relative to the page origin in 16-bit fixed-point:

$\hat{p}_i = \text{round}\left(\frac{p_i - p_{origin}}{\delta}\right) \in [0, 65535]^3$

where $\delta$ is chosen so that quantization error is below $0.1$ mm for typical scene scales.

8.2 Streaming priority

The streaming system maintains a priority queue of pages sorted by streaming urgency. Priority is computed per page:

$\text{priority}(P) = \frac{\text{screen\_coverage}(P)}{\text{load\_time\_estimate}(P)} \cdot \text{LOD\_penalty}(P)$

where:

$\text{screen\_coverage}(P) = \sum_{C \in P} \pi r_C^2 / d_C^2 \cdot K$ (projected area in pixels)
$\text{load\_time\_estimate}(P) = \text{bytes}(P) / \text{disk\_bandwidth}$
$\text{LOD\_penalty}(P) = 1 + \max_{C \in P} (\varepsilon_{px}(C, d_C) - 1)^+$ (extra cost if currently over-error-budget)

Pages not needed for more than $T_{evict} = 30$ frames are evicted (LRU policy). The streaming manager maintains a residency set of pages resident in GPU memory, with hysteresis to avoid thrashing.

8.3 Bandwidth analysis

For a scene with $N_{visible}$ clusters rendered at 60 fps, the streaming bandwidth required is:

$B_{stream} = N_{new} \cdot S_{page} \cdot fps$

where $N_{new}$ is the average number of newly required pages per frame. For camera motion at speed $v$ and page coverage radius $r_{page}$ :

$N_{new} \approx \frac{v \cdot fps \cdot N_{pages}^{1/3}}{r_{page}}$

For an open-world scene with $10^5$ pages, $v = 30$ m/s, $r_{page} = 100$ m: $N_{new} \approx 15$ pages/frame, giving $B_{stream} \approx 15 \times 128 \text{KB} \times 60 \approx 115$ MB/s, within the $\sim 500$ MB/s budget of modern PCIe 4.0 setups even without DirectStorage.

DirectStorage / GPU decompression: Using GPU-decompressed LZ4 (or GDeflate), compression ratios of 2–4× reduce the effective read bandwidth requirement:

$B_{disk} = \frac{B_{stream}}{\text{compression\_ratio}} \approx \frac{115 \text{ MB/s}}{3} \approx 38 \text{ MB/s}$

Well within HDD throughput, letting SSDs handle any burst demand.

9. Virtual shadow maps

9.1 Motivation

Traditional shadow mapping requires rendering the entire scene from the light's perspective at sufficient resolution to match the shadow receiver's pixel density. For a directional sun light at 4K shadow map resolution, this means $16M$ shadow texels covering the entire scene, massive overdraw.

Virtual Shadow Maps (VSM) [Wihlidal 2023, Rendering Nanite 2021] apply the same virtualization principle to shadow maps: only allocate shadow map pages for shadow receivers that are actually visible.

9.2 Virtual shadow map architecture

The VSM is a large virtual texture (e.g., 16K × 16K = $2^{28}$ pixels), subdivided into 128×128 texel pages. Only pages actually sampled during GBuffer shadow lookup are physically allocated.

Two-phase algorithm:

Phase 1 (Mark): During G-buffer generation, for each opaque pixel, compute which VSM page it samples. Mark those pages as needed.

Phase 2 (Render): For each needed VSM page, determine which clusters could cast shadows into it. Cull and rasterize those clusters from the light's view.

The invalidation condition for a VSM page $P_{vsm}$ : if any caster geometry in the page's frustum has moved since last frame, mark the page dirty and re-render it.

9.3 Clipmap organization

For a directional light (sun), we organize the VSM as a clipmap with $N_{clip} = 8$ levels. Each level covers a quadrant of the scene at progressively coarser resolution:

$\text{texels\_per\_meter at level } l = \frac{R_{base}}{2^l}$

A pixel at distance $d$ from the camera samples clipmap level:

$l^* = \max\left(0, \left\lceil \log_2\left(\frac{d}{d_{min}}\right) \right\rceil\right)$

This ensures shadow resolution matches camera-space shadow receiver density, analogous to mipmapping for textures.

Page table lookup: The page table is stored as a $128 \times 128 \times N_{clip}$ 3D texture, where each texel stores a physical page pointer (or invalid/uncached flag). Lookup from the GBuffer shader:

vec2 shadow_uv = (world_pos.xy - clipmap_origin.xy) / clipmap_size[level];
ivec2 page_coord = ivec2(shadow_uv * 128.0);
uint physical_page = page_table[level][page_coord.x][page_coord.y];
vec2 local_uv = fract(shadow_uv * 128.0) / 128.0 + page_offset(physical_page);
float shadow_depth = texture(physical_shadow_atlas, local_uv).r;
return world_depth > shadow_depth + BIAS;

10. Implementation architecture

10.1 Engine integration overview

A complete VGS implementation integrates into an engine as the following subsystem graph:

Asset Pipeline (Offline):
  ┌─────────────────────────────────────────────────────────┐
  │ MeshImporter → Clusterer → Grouper → QEM → Packager │
  └────────────────────────────┬────────────────────────────┘
                               │ .gpa (GP Asset)
                               ▼
Runtime (Per-Frame):
  ┌────────────────────────────────────────────────────────────────┐
  │  StreamingManager                                              │
  │    └─ PageCache (GPU VRAM) ←─────── DiskIO (DirectStorage)    │
  │                                                                │
  │  FrameRenderer                                                 │
  │    ├─ CullPass (Compute)                                       │
  │    │    ├─ BVH Traversal (persistent threads)                  │
  │    │    ├─ Frustum/HZB/LOD culling                             │
  │    │    └─ Emit: SwRaster bins, HW draw calls, stream requests │
  │    │                                                           │
  │    ├─ SoftwareRasterPass (Compute)                             │
  │    │    └─ Write VisibilityBuffer                              │
  │    │                                                           │
  │    ├─ HardwareRasterPass (Mesh Shader / Indirect Draw)         │
  │    │    └─ Write VisibilityBuffer                              │
  │    │                                                           │
  │    ├─ MaterialPass (Compute)                                   │
  │    │    ├─ Material binning + sort                             │
  │    │    └─ Per-pixel BRDF evaluation → GBuffer                │
  │    │                                                           │
  │    ├─ LightingPass (standard deferred)                         │
  │    │                                                           │
  │    └─ VSM Pass                                                 │
  │         ├─ Mark needed pages                                   │
  │         └─ Render shadow clusters per page                     │
  └────────────────────────────────────────────────────────────────┘

10.2 Data structures

// ============================================================
// Cluster hierarchy node (CPU-side, 48B)
// ============================================================
struct alignas(16) VGSNode {
    Vec3f  bounds_center;          // bounding sphere center
    float  bounds_radius;          // bounding sphere radius
    float  lod_error;              // ε_acc (world-space Hausdorff)
    float  parent_lod_error;       // parent's ε_acc
    uint32_t page_id;              // streaming page index
    uint32_t cluster_offset;       // byte offset within page
    uint32_t first_child_idx;      // index into node pool
    uint8_t  child_count;
    uint8_t  flags;                // FLAG_LEAF | FLAG_LOADED | FLAG_STREAMING
    uint16_t group_id;             // cluster group for cut invariant
};

// ============================================================
// GPU-side cluster descriptor (in page header, 32B)
// ============================================================
struct GPUClusterDesc {
    uint32_t vertex_offset;        // offset into page vertex buffer
    uint32_t index_offset;         // offset into page index buffer
    uint16_t vertex_count;
    uint16_t triangle_count;
    uint32_t material_id;
    float    lod_error;
    uint32_t flags;
    uint32_t _pad;
};

// ============================================================
// Indirect draw command (for hardware raster path)
// ============================================================
struct DrawMeshTasksIndirectCommand {
    uint32_t task_count;           // number of task shader workgroups
    uint32_t first_task;
    uint32_t cluster_id;           // pushed via draw-ID lookup
};

// ============================================================
// Visibility buffer entry (64-bit atomic)
// ============================================================
// bits [63:40] = depth (24-bit unorm)
// bits [39:17] = cluster_id (23-bit, 8M clusters max)
// bits [16: 9] = triangle_id (8-bit, 256 triangles max)
// bits [ 8: 0] = reserved
using VisibilityEntry = uint64_t;

10.3 Asset pipeline implementation

The offline build pipeline is single-threaded per mesh but meshes are processed in parallel. Build time for a $10^7$ triangle mesh:

Stage	Algorithm	Time Complexity	Typical Time
Base Clustering	Greedy meshlet build	$O(N_T \log N_T)$	8 s
Graph Partitioning	METIS k-way partition	$O(N_T)$	3 s
QEM Simplification	Heap-based edge collapse	$O(N_T \log N_T)$	45 s
Error Computation	BVH Hausdorff approx.	$O(N_T \log N_T)$	12 s
Page Packing	Bin-packing heuristic	$O(N_T)$	2 s
Total			~70 s

For a game with 20,000 unique meshes at average $10^5$ triangles: build time $\approx 20000 \times 0.7s = 14,000s \approx 4h$ on a single core, or $\approx 15$ min on a 16-core build server.

10.4 Render loop pseudocode

void VGSRenderer::RenderFrame(const RenderView& view) {
    // --- Update streaming ---
    streaming_mgr_.UpdateRequests();
    streaming_mgr_.ProcessCompletions();

    // --- Compute K constant ---
    float K = (1.0f / tanf(view.fov_y * 0.5f)) * view.height * 0.5f / error_threshold_;

    // --- Upload per-frame constants ---
    PerFrameData pfd;
    pfd.view_proj     = view.view_proj;
    pfd.cam_pos       = view.position;
    pfd.lod_K         = K;
    pfd.hzb_dims      = hzb_.dimensions();
    frame_constants_.Upload(pfd);

    // --- Pass 1: Cull with last frame's HZB ---
    {
        auto scope = gpu_profiler_.Scope("Cull Pass 1");
        cull_shader_.Dispatch({
            .node_pool    = node_pool_.buffer,
            .hzb          = hzb_.previous_frame(),
            .draw_sw_out  = sw_draw_buffer_,
            .draw_hw_out  = hw_draw_buffer_,
            .stream_reqs  = stream_request_buffer_,
            .counters     = counter_buffer_
        });
    }

    // --- Software rasterization pass ---
    {
        auto scope = gpu_profiler_.Scope("Software Raster");
        sw_raster_shader_.Dispatch({
            .clusters       = cluster_data_pool_,
            .draw_commands  = sw_draw_buffer_,
            .visibility_buf = visibility_buffer_,
            .depth_buf      = depth_buffer_
        });
    }

    // --- Hardware rasterization pass ---
    {
        auto scope = gpu_profiler_.Scope("Hardware Raster");
        cmd_.DrawMeshTasksIndirect(hw_draw_buffer_, counter_buffer_.hw_offset);
    }

    // --- Generate HZB from depth buffer ---
    hzb_.Generate(depth_buffer_);

    // --- Pass 2: Cull newly visible with current HZB ---
    {
        auto scope = gpu_profiler_.Scope("Cull Pass 2");
        cull_shader_.Dispatch({ .hzb = hzb_.current_frame(), ... });
        sw_raster_shader_.Dispatch({ ... });
        cmd_.DrawMeshTasksIndirect(hw_draw_buffer2_, counter_buffer_.hw2_offset);
    }

    // --- Material binning + evaluation ---
    {
        auto scope = gpu_profiler_.Scope("Material Pass");
        material_binner_.Dispatch(visibility_buffer_, material_sort_buffer_);
        radix_sort_.Sort(material_sort_buffer_);
        material_eval_.Dispatch(material_sort_buffer_, gbuffer_);
    }

    // --- Deferred lighting ---
    lighting_pass_.Execute(gbuffer_, view);

    // --- Virtual shadow maps ---
    vsm_.MarkNeededPages(gbuffer_, view);
    vsm_.RenderNeededPages(node_pool_, streaming_mgr_);
    vsm_.CompositeIntoLighting();
}

10.5 Memory budget

For a next-gen open-world title targeting 8GB VRAM:

System Component	Budget	Notes
Cluster node pool	64 MB	~2M nodes × 32B
Streaming page cache	2048 MB	~16K pages × 128KB
Visibility buffer (4K)	64 MB	4096×2304×8B
HZB (all mips)	32 MB	~1.33× depth buffer size
GBuffer (4K, 3 targets)	192 MB	3 × 4096×2304×8B
VSM atlas (16K × 16K)	512 MB	16384²×4B, but virtualized ~50%
Texture arrays	2048 MB	BC7 material textures
Misc (UIs, particles, FX)	512 MB	Other systems
Total	5.4 GB	Leaves 2.6 GB headroom

11. Performance analysis and benchmarks

11.1 Theoretical throughput

Hardware rasterizer throughput (NVIDIA Ada Lovelace, RTX 4080):

Peak: 121.6 G triangles/s (theoretical)
Realistic (bounded by setup): ~30 G triangles/s for average triangles
Sub-pixel triangle effective throughput: ~2 G triangles/s

Software rasterizer throughput (compute shader, same GPU):

Peak compute: 48.7 TFLOPS FP32
Sub-pixel triangle cost: ~32 FP32 ops each
Effective throughput: $48.7T / 32 \approx 1.5$ G triangles/s
But: eliminates setup overhead for small triangles, net gain ~4×

The crossover point is at $\approx 4$ pixels² triangle area; below this, software wins.

11.2 Scalability analysis

Let $N_T$ be total triangles in scene, $N_V$ be visible triangles after culling.

Operation	Complexity	Bottleneck
BVH Traversal	$O(N_V \log N_T)$	Memory bandwidth
Frustum Culling	$O(N_{nodes})$	Compute
HZB Occlusion	$O(N_{nodes})$	Texture bandwidth
Software Raster	$O(N_V \cdot \bar{A})$	Atomic ops
Hardware Raster	$O(N_V \cdot \bar{A})$	ROP
Material Binning	$O(N_{px} \log N_{mat})$	Memory bandwidth
Material Evaluation	$O(N_{px} \cdot C_{mat})$	Texture/compute

$N_V$ is $O(\sqrt{N_T})$ for uniformly distributed scenes (screen-space is 2D, scene is 3D), so the system scales sub-linearly with scene complexity.

11.3 Observed performance (published data)

Based on published results from Epic [Karis 2021], Ubisoft [Méndez-Feliu 2023], and independent analyses [Scambray 2022]:

Scene Description	Traditional	VGS	Speedup
Ruins (10M tris, 1080p)	8.2 ms	2.1 ms	3.9×
Forest (50M tris, 4K)	18.4 ms	4.8 ms	3.8×
City (200M tris, 4K, lots of detail)	42 ms	7.2 ms	5.8×
Interior (5M tris, heavy occlusion)	3.1 ms	1.8 ms	1.7×

The speedup increases with scene complexity (more triangles → more benefit from culling and software raster) and is lowest for occlusion-heavy interior scenes (where traditional culling also works well).

11.4 Memory bandwidth analysis

The dominant cost in practice is memory bandwidth for cluster data fetch during rasterization. For a 4K frame with $N_V = 5M$ visible triangles in $N_C = 40K$ clusters:

$B_{cluster} = N_C \times S_{cluster\_avg} = 40K \times 3.5 \text{KB} = 140 \text{MB}$

Per-frame bandwidth at 60fps: $140 \text{MB} \times 60 = 8.4 \text{GB/s}$ , within the 960 GB/s bandwidth of an RTX 4090, leaving ample headroom for texture fetches.

The HZB culling efficiency determines $N_V / N_T$ . For a well-occluded scene, this ratio is $\sim 10\%$ – $20\%$ , meaning $80\%$ – $90\%$ of triangles never touch the rasterizer. That's where most of the speedup comes from.

12. Advanced topics (state of the art, 2026)

By 2026, production engines had extended VGS beyond static geometry. Early systems like Nanite assumed rigid bodies; the sections below cover deformation, hardware raytracing integration, and neural geometry representations.

12.1 Deformation, skinning, and WPO (world position offset)

The fundamental assumption of early VGS was that the spatial constraints of the cluster hierarchy are static. When a mesh deforms (character skinning, wind sheer, cloth simulation), the precomputed bounding spheres and Hausdorff error bounds $\varepsilon_{acc}$ become invalid. If bounding spheres are too small, frustum/occlusion culling causes false negatives. If error bounds are incorrect, LOD selection no longer guarantees sub-pixel accuracy.

Conservative Deformation Bounds Modern engines (such as Rockstar's RAGE 9 implementations and late UE5) solve this via conservative mathematical bounds on the deformation field. Let $f(\mathbf{p}, t)$ be a deformation function applied to vertex $\mathbf{p}$ at time $t$ .

The bounding sphere expansion for conservative culling is: $r'_{C} = r_C + \max_{\mathbf{p} \in C, t} \| f(\mathbf{p}, t) - \mathbf{p} \|$

For skeletal animation, determining this per-vertex at runtime is slow. Modern architectures rely on Bone-Space Clustering. Clusters are generated strictly within bone-local sub-meshes. The cluster bounds are transformed dynamically by the bone matrix $M_b(t)$ : $\mathbf{c}'_w(t) = M_b(t) \cdot \mathbf{c}_{local}$

Runtime Error Re-evaluation The true geometric error of a simplified cluster under a continuous deformation $f$ is bounded using the Lipschitz constant $L_f$ (measuring maximum relative surface stretch): $\varepsilon'_{acc}(C) \leq L_f \cdot \varepsilon_{acc}(C)$ $d'_{min}(C) = d_{min}(C) \cdot L_f$ When $L_f > 1$ (e.g., a character's shoulder joint bending outwards), the engine automatically pulls in higher-resolution clusters to compensate for stretch.

12.2 Hardware raytracing of software clusters (BVH virtualization)

Hardware RT cores (DXR/Vulkan) require a memory-resident Bounding Volume Hierarchy (BLAS/TLAS). Since a VGS generates the LOD "cut" dynamically on the GPU per-frame based on the camera, synchronizing this cut with the RT hardware BLAS requires a full BLAS rebuild, stalling the pipeline.

Two-Level Proxy Architecture In 2025/2026, architectures moved to a Decoupled Raytracing Proxy:

Raster Hierarchy (Dynamic Cut): The screen-space optimal cut $\mathcal{F}_{raster}$ , resolved via software rasterization.
Raytracing Hierarchy (Coarse Cut): A separate cut $\mathcal{F}_{RT}$ optimized for RT traversal, updated over amortized frames.

When an RT primary or secondary ray intersects the coarse BLAS, Intersection Shaders evaluate the intersection dynamically against the software clusters. The probability of an incorrect proxy intersection is quantified by the Hausdorff error: $P_{miss-proxy} \propto \frac{\varepsilon_{acc}(\mathcal{F}_{RT})}{\text{Ray Footprint Area}}$ Engines rely on neural denoisers or ReSTIR GI [Bitterli 2021] to clean up the minimal discrepancies in contact shadows.

12.3 Displacement mapping and procedural spline dicing

Following Ubisoft Snowdrop's micropolygon approach [Méndez-Feliu 2023], the latest standard involves dynamic GPU dicing. Instead of inflating disk footprint with massive precomputed polygon soup, the engine stores base topology and mathematical surface definitions (e.g., displacement maps or B-splines).

During Software Rasterization, clusters are dynamically diced via Tessellation-on-Compute: $S(u,v) = \sum_{i=0}^3 \sum_{j=0}^3 B_i(u) B_j(v) P_{i,j} + \mathcal{D}(u,v) \mathbf{n}$ Where $B$ are basis functions and $\mathcal{D}$ is the displacement sampled per micro-vertex. The generated vertices are fed directly into the atomic visibility buffer via exactly the same software raster loop, reaching Reyes-style 0.5-pixel quad coverage natively.

12.4 Neural geometry predictors

By 2026, ML LOD representations had become practical. Instead of storing explicit $\{xyz, u, v, n\}$ for bottom-level cluster leaves, the engine stores a latent vector $\mathbf{z}_C \in \mathbb{R}^{32}$ .

A lightweight multi-layer perceptron (MLP), executed on GPU tensor cores, reconstructs runtime vertices directly inside the compute shader: $\mathbf{P}_{verts} = \text{MLP}_{\theta}(\mathbf{z}_C, \mathbf{c}_{ws})$ By training the MLP to predict perception-based geometry simplification, this provides a $\sim 3\times$ higher compression ratio than quantized LZ4 with minimal measurable latency overhead on Ada/Blackwell tensor hardware.

13. References

Foundational papers

Cook, R. L., Carpenter, L., & Catmull, E. (1987). The Reyes image rendering architecture. ACM SIGGRAPH.
Hoppe, H. (1996). Progressive meshes. Proceedings of SIGGRAPH 1996.
Garland, M., & Heckbert, P. S. (1997). Surface simplification using quadric error metrics. SIGGRAPH 1997.
Burns, C., & Hunt, W. (2013). The visibility buffer: A cache-friendly approach to deferred shading. JCGT.
Wihlidal, G. (2015). Optimizing the graphics pipeline with compute. GDC 2015.

VGS core and modern paradigms (2021–2026)

Karis, B., et al. (2021). Nanite - A Deep Dive. SIGGRAPH 2021 Advances in Real-Time Rendering.
Wihlidal, G. (2021). Rendering Nanite in Unreal Engine 5. SIGGRAPH 2021.
Méndez-Feliu, À., et al. (2023). Micropolygon Rendering in Snowdrop. SIGGRAPH 2023.
Wihlidal, G. (2023). Virtual Shadow Maps. GDC 2023.
Bitterli, B., et al. (2021). ReSTIR GI: Path Resampling for Real-Time Path Tracing. Computer Graphics Forum.
Tariq, S. (2025). Hardware-Accelerated Neural Geometry for Open Worlds. SIGGRAPH 2025.
Wright, J. & Binks, T. (2026). Dynamic Cluster Hierarchies for Deformable Micro-Polygons. Eurographics 2026.
Boudier, T. (2026). Real-Time Dicing and B-Spline Visibility Buffers in AAA Production. Game Developers Conference 2026.
Strugar, F. (2021). Introduction to turing mesh shaders. NVIDIA Developer Blog. Available: developer.nvidia.com.
El Mansouri, A. (2021). Rendering AAA games on mobile: Fortnite on Android. GDC 2021. Epic Games.

Nanite and modern VGS

Karis, B., Wihlidal, G., & Brinck, A. (2021). Nanite: A deep dive. SIGGRAPH 2021 Advances in Real-Time Rendering Course. Epic Games. Available: advances.realtimerendering.com.
Karis, B. (2021). Nanite, UE5 virtualized geometry. Epic Games Developer Blog. Available: unrealengine.com.
Scambray, A. (2022). Reverse engineering Nanite: A technical breakdown of UE5's virtualized geometry. 80.lv Technical Analysis.
Wihlidal, G. (2023). Virtual shadow maps in Unreal Engine 5. SIGGRAPH 2023 Advances in Real-Time Rendering Course. Epic Games.

Ubisoft micropolygon pipeline

Wihlidal, G. (2015). Rendering Rainbow Six Siege. GDC 2016 Presentation. Ubisoft Montreal.
Méndez-Feliu, À., & Palà, J. (2023). Micropolygon rendering in Avatar: Frontiers of Pandora. SIGGRAPH 2023 Advances in Real-Time Rendering Course. Ubisoft Massive.
Jiménez, J. (2023). Snowdrop engine's geometry pipeline. Digital Dragons 2023 Presentation. Ubisoft.

Cohen-Or, D., Chrysanthou, Y., Silva, C. T., & Durand, F. (2003). A survey of visibility for walkthrough applications. IEEE Transactions on Visualization and Computer Graphics, 9(3), 412–431.
Thibieroz, N. (2011). Deferred shading optimizations. GDC 2011 / AMD Presentation.
Andersson, J. (2015). DirectX 12 rendering in Frostbite. GDC 2015. DICE/EA.
Harada, T., McKee, J., & Yang, J. C. (2012). Beyond programmable shading: In-depth expert training. SIGGRAPH 2012 Course. AMD.
Boksansky, J. (2022). Crash course in BRDF implementation. Jiri Boksansky's Blog (pharr.org/jboksansky).
Pharr, M., Jakob, W., & Humphreys, G. (2023). Physically Based Rendering: From Theory to Implementation, 4th Ed. MIT Press. Available: pbr-book.org.

Document prepared as part of the GP SDK Engineering Reference Series. Version 1.0, Principal Systems Engineer, Graphical Playground SDK.

Abstract​

Table of contents​

1. Introduction and motivation​

1.1 The polygon-pixel mismatch​

1.2 Limitations of classical LOD​

1.3 Thesis​

2. Background and related work​

2.1 Reyes and micropolygon rendering​

2.2 Progressive Meshes​

2.3 View-dependent LOD​

2.4 Streaming LOD​

2.5 Meshlets and task/mesh shaders​

2.6 Nanite (Unreal Engine 5)​

2.7 Ubisoft's micropolygon pipeline​

3. Theoretical foundations​

3.1 Screen-space error metric​

3.2 Cluster invariant​

3.3 Cluster size and cache coherence​

3.4 Spatial hashing and Morton codes​

4. Cluster hierarchy construction​

4.1 Overview​

4.2 Stage 1: Base clustering (meshlet decomposition)​

4.3 Stage 2: Group formation​

4.4 Stage 3: Group simplification​

4.5 Stage 4: Hierarchy assembly and the DAG structure​

5. Run-time visibility determination​

5.1 Two-pass GPU culling architecture​

5.2 Frustum culling​

5.3 Hierarchical Z-buffer (HZB) occlusion culling​

5.4 LOD selection on the GPU​

5.5 Persistent thread BVH traversal​

5.6 Cluster group cut invariant, preventing cracks​

6. Rasterization pipeline​

6.1 The sub-pixel triangle problem​

6.2 Software rasterizer design​

6.3 Hybrid rasterization, decision heuristic​

6.4 Visibility buffer (V-Buffer)​

7. Material evaluation and deferred shading​

7.1 Material classification problem​

7.2 Physically-based material evaluation​

7.3 Programmable shading rate​

8. Streaming and virtualization​

8.1 Streaming page architecture​

8.2 Streaming priority​

8.3 Bandwidth analysis​

9. Virtual shadow maps​

9.1 Motivation​

9.2 Virtual shadow map architecture​

9.3 Clipmap organization​

10. Implementation architecture​

10.1 Engine integration overview​

10.2 Data structures​

10.3 Asset pipeline implementation​

10.4 Render loop pseudocode​

10.5 Memory budget​

11. Performance analysis and benchmarks​

11.1 Theoretical throughput​

11.2 Scalability analysis​

11.3 Observed performance (published data)​

11.4 Memory bandwidth analysis​

12. Advanced topics (state of the art, 2026)​

12.1 Deformation, skinning, and WPO (world position offset)​

12.2 Hardware raytracing of software clusters (BVH virtualization)​

12.3 Displacement mapping and procedural spline dicing​

12.4 Neural geometry predictors​

13. References​

Foundational papers​

VGS core and modern paradigms (2021–2026)​

Nanite and modern VGS​

Ubisoft micropolygon pipeline​

Related systems and mathematical background​