Skip to main content

Virtualized Geometry Systems: Architecture, Theory, and Implementation

· 41 min read
Mallory Scotton
Founder of Graphical Playground

A Principal Engineer's Guide to Production-Grade Micro-Polygon Rendering


"The geometry problem is not one of bandwidth alone, it is a problem of indirection, coherence, and the fundamental mismatch between the triangle and the pixel."


Abstract

Modern game engines face an intractable triangle throughput ceiling: GPU rasterizers are designed around polygons covering several pixels, yet cinematic assets routinely produce triangles smaller than a single pixel at runtime distances. Classical Level-of-Detail (LOD) pipelines fail at this scale due to popping artifacts, authoring overhead, and the inability to leverage the full spatial resolution of sculpted meshes. This paper covers Virtualized Geometry Systems (VGS), the family of techniques that includes Epic Games' Nanite (Unreal Engine 5), Ubisoft's micropolygon pipeline, and related academic precursors such as Reyes, Progressive Meshes, and Streaming Mesh systems. Topics include cluster-based hierarchical level-of-detail, screen-space error metrics, BVH/DAG visibility culling, software rasterization fallback paths, virtual shadow maps, and GPU-driven draw dispatch. Included is a concrete implementation blueprint targeting a C++23/Vulkan 1.3 engine, with pseudocode, data layout specifications, and complexity analyses.

Keywords: virtualized geometry, LOD, BVH, cluster hierarchy, software rasterization, meshlet, screen-space error, GPU-driven rendering, visibility buffer, nanite


Table of contents

  1. Introduction and motivation
  2. Background and related work
  3. Theoretical foundations
  4. Cluster hierarchy construction
  5. Run-time visibility determination
  6. Rasterization pipeline
  7. Material evaluation and deferred shading
  8. Streaming and virtualization
  9. Virtual shadow maps
  10. Implementation architecture
  11. Performance analysis and benchmarks
  12. Advanced topics (state of the art, 2026)
  13. References

1. Introduction and motivation

1.1 The polygon-pixel mismatch

The fundamental unit of GPU rasterization is the triangle. A triangle covers a contiguous set of pixels, and the rasterizer emits one or more quads (2×2 pixel blocks) per triangle to guarantee correct partial-derivative computation for texture sampling. This design carries an implicit assumption: triangles should cover multiple pixels.

Let a mesh MM have NTN_T triangles. At camera distance dd, the projected area of a single triangle in screen space is, on average:

Aˉpx=Awscot2(θfov/2)d2WrHr\bar{A}_{px} = \frac{A_{ws} \cdot \cot^2(\theta_{fov}/2)}{d^2 \cdot W_r \cdot H_r}

where AwsA_{ws} is the average world-space triangle area, θfov\theta_{fov} is the vertical field of view, and (Wr,Hr)(W_r, H_r) is the render resolution. When Aˉpx1\bar{A}_{px} \ll 1, the rasterizer is forced to test entire quads (4 pixels) to shade a fraction of one pixel, the classical overdraw-vs-underdraw inefficiency known as quad overshading.

For a typical film-quality asset with NT=107N_T = 10^7 triangles rendered at d=10d = 10 meters:

Aˉpx0.003 px2\bar{A}_{px} \approx 0.003 \text{ px}^2

This is three orders of magnitude below the break-even threshold. The GPU wastes approximately 99.7% of fragment shader invocations on invisible or sub-pixel geometry. The Nanite whitepaper [Karis 2021] quantifies this: the naive pipeline achieves roughly 1 triangle per 20 cycles under these conditions, whereas the theoretical peak is 1 triangle per 2 cycles.

1.2 Limitations of classical LOD

The traditional solution is Discrete LOD (DLOD): artists hand-author 4–6 mesh resolutions, and the engine selects one per object per frame. This fails for the modern production pipeline in four fundamental ways:

  1. Authoring cost scales linearly with the number of unique mesh assets, which now numbers in the tens of thousands for open-world titles.
  2. Popping artifacts occur at LOD transition boundaries, requiring expensive cross-fade blending that still fails under motion.
  3. Memory redundancy: each LOD copy stores topology independently, inflating VRAM usage by a factor of 1.3\approx 1.31.8×1.8\times.
  4. Granularity: DLOD operates per-object, ignoring the fact that different regions of a single mesh may require different resolutions simultaneously (e.g., a building visible both up close and in a distant reflection).

Continuous LOD (CLOD) schemes such as Progressive Meshes [Hoppe 1996] address authoring and popping at the cost of GPU-unfriendly sequential edge-collapse dependencies. They do not parallelize.

1.3 Thesis

We argue that the correct solution decouples geometry storage from geometry rendering via a four-layer architecture:

[Offline] Cluster Hierarchy Build → Streaming Asset

[Load-time] Hierarchical BVH Construction

[Frame] GPU-Driven Cluster Selection (screen-space error)

[Frame] Software + Hardware Rasterization of selected clusters

Each layer is independently scalable, cache-friendly, and GPU-parallelizable.


2.1 Reyes and micropolygon rendering

The Reyes rendering architecture [Cook et al. 1987] introduced the concept of micropolygons: sub-pixel-sized quads generated by dicing parametric surfaces. Reyes dices surfaces until each patch produces quads of exactly 0.5 pixel in screen space, achieving theoretically optimal triangle density. Reyes is offline only, as dicing is CPU-bound and generates geometry on-the-fly. Our work brings this ideal to real-time.

2.2 Progressive Meshes

Hoppe's Progressive Meshes [1996] represent a mesh as a base coarse mesh plus a sequence of vertex splits V={v0,v1,,vk}V = \{v_0, v_1, \ldots, v_k\}. Any prefix of VV produces a valid mesh approximation. This gives a continuous, view-independent LOD stream, but the sequential nature of vertex splits prevents GPU parallelism. The dependency graph of vertex splits is a DAG with depth O(NT)O(N_T), preventing batch execution.

2.3 View-dependent LOD

Xia and Varshney [1996] and Hoppe [1997] introduced view-dependent LOD that adapts per-region. A multi-resolution mesh stores vertex splits annotated with screen-space error bounds. At runtime, the active front of splits is advanced until error bounds are met everywhere. This achieves per-region adaptation but still runs on the CPU and cannot stream.

2.4 Streaming LOD

Losasso and Hoppe [2004] and later Cignoni et al. [2004] (Batched Multi-Triangulation) introduced streaming mesh representations that load geometry on demand. The core insight, that LOD levels can be stored as independent clusters with known error bounds, directly underpins Nanite's design.

2.5 Meshlets and task/mesh shaders

NVIDIA's introduction of Task/Mesh Shaders (Turing, 2018) and the subsequent Vulkan VK_NV_mesh_shader / VK_EXT_mesh_shader extension provided the GPU programming model that makes GPU-side cluster selection tractable. A meshlet [Wihlidal 2018, Strugar 2021] is a fixed-size cluster of up to 64/128 triangles with a precomputed bounding cone for backface culling. Mesh shaders allow a compute-like amplification stage that selects which meshlets to rasterize per-thread-group.

2.6 Nanite (Unreal Engine 5)

Epic's Nanite [Karis 2021] is the production system that combined these ideas at scale. Contributions:

  • A hierarchical cluster DAG built offline, with LOD selection granularity finer than any previous system.
  • Software rasterization of small triangles via a compute shader path, avoiding hardware rasterizer setup overhead.
  • Persistent threads and a two-pass GPU culling architecture.
  • Integration with Virtual Shadow Maps for high-quality shadow rendering.

2.7 Ubisoft's micropolygon pipeline

Ubisoft's Rainbow Six Siege and Avatar: Frontiers of Pandora teams developed independent micropolygon systems [Wihlidal 2015, Méndez-Feliu 2023]. Their approach emphasizes:

  • Displacement map dicing at render time rather than offline baking.
  • Parametric surface evaluation on the GPU, generating geometry in compute shaders.
  • Integration with their proprietary Snowdrop engine's material system.

3. Theoretical foundations

3.1 Screen-space error metric

The central decision in any LOD system is: at what point is a simplified representation indistinguishable from the original? We define the geometric error of a cluster CC relative to its parent cluster PP as:

ε(C)=maxpPd(p,P^C)\varepsilon(C) = \max_{p \in P} d(p, \hat{P}_C)

where P^C\hat{P}_C is the closest-point projection of the parent surface onto the simplified cluster surface, and d(,)d(\cdot, \cdot) is Euclidean distance. This is the Hausdorff distance between the two surfaces:

ε(C)=dH(C,P)=max(supxCinfyPxy,supyPinfxCxy)\varepsilon(C) = d_H(C, P) = \max\left( \sup_{x \in C} \inf_{y \in P} \|x-y\|, \sup_{y \in P} \inf_{x \in C} \|x-y\| \right)

The projected screen-space error at camera distance dd is:

εpx(C,d)=ε(C)cot(θfov/2)dHr2\varepsilon_{px}(C, d) = \varepsilon(C) \cdot \frac{\cot(\theta_{fov}/2)}{d} \cdot \frac{H_r}{2}

This is the number of pixels by which the simplified cluster could differ from ground truth at distance dd. We define a threshold εmax\varepsilon_{max} (typically 1.0 pixel) and select the coarsest cluster CC^* such that:

εpx(C,d)εmax\varepsilon_{px}(C^*, d) \leq \varepsilon_{max}

Equivalently, we compute the minimum viewing distance at which cluster CC is acceptable:

dmin(C)=ε(C)cot(θfov/2)Hr2εmaxd_{min}(C) = \varepsilon(C) \cdot \frac{\cot(\theta_{fov}/2) \cdot H_r}{2 \cdot \varepsilon_{max}}

This distance is precomputed offline and stored per cluster as a single float. At runtime, the LOD selection reduces to a comparison: render cluster CC if d>dmin(C)d > d_{min}(C).

3.2 Cluster invariant

For a valid cluster hierarchy H\mathcal{H}, the following invariant must hold for any node CC with parent P(C)P(C):

ε(P(C))ε(C)\varepsilon(P(C)) \geq \varepsilon(C)

and:

Bounds(P(C))Bounds(C)\text{Bounds}(P(C)) \supseteq \text{Bounds}(C)

The first condition ensures monotone error growth up the hierarchy (coarser = more error). The second ensures spatial containment. These two properties guarantee that the LOD selection produces a cut through the hierarchy, a set of non-overlapping clusters that together cover the entire surface exactly once at the chosen resolution.

Lemma (Cut Correctness): Let F={CH:εpx(C,d)εmax\mathcal{F} = \{C \in \mathcal{H} : \varepsilon_{px}(C, d) \leq \varepsilon_{max} and εpx(P(C),d)>εmax}\varepsilon_{px}(P(C), d) > \varepsilon_{max}\} be the set of clusters selected by the LOD criterion. If the hierarchy invariant holds, then F\mathcal{F} is a partition of the original surface, and maxCFεpx(C,d)εmax\max_{C \in \mathcal{F}} \varepsilon_{px}(C, d) \leq \varepsilon_{max}.

Proof sketch: By induction on hierarchy depth. The root has ε(root)>εmax\varepsilon(root) > \varepsilon_{max} for all dd below a threshold. The leaves are the original clusters with ε=0\varepsilon = 0. The selection criterion picks exactly the frontier between acceptable and unacceptable, which is a valid cut by the monotonicity invariant. \square

3.3 Cluster size and cache coherence

A cluster stores NtriN_{tri} triangles. The optimal NtriN_{tri} balances:

  • Culling granularity: smaller clusters → finer culling → less overdraw
  • Culling overhead: smaller clusters → more BVH nodes → more culling work
  • Cache line utilization: GPU L1 cache lines are 128 bytes; vertex data should fill an integer number of cache lines

For a vertex storing position (12B), normal (12B), UV (8B), and tangent (8B) = 40 bytes, a cache line holds 3.2 vertices. A cluster of NvN_v vertices spans:

Lcache(Nv)=40Nv128 cache linesL_{cache}(N_v) = \left\lceil \frac{40 N_v}{128} \right\rceil \text{ cache lines}

NVIDIA documentation recommends Ntri124N_{tri} \leq 124, Nv64N_v \leq 64 [Wihlidal 2018]. The Nanite system uses Ntri=128N_{tri} = 128, Nv=128N_v = 128 per cluster, noting that larger clusters amortize per-cluster overhead better for their software rasterizer.

3.4 Spatial hashing and Morton codes

During hierarchy traversal, we need to quickly locate the cluster bounding a query point. We use Morton codes (Z-order curve) to map 3D cluster centroids to a 1D key that preserves spatial locality:

Morton(x,y,z)=interleave(x/δ,y/δ,z/δ)\text{Morton}(x, y, z) = \text{interleave}(\lfloor x/\delta \rfloor, \lfloor y/\delta \rfloor, \lfloor z/\delta \rfloor)

For a 64-bit Morton code with 21 bits per axis, the interleaving function is:

M(i)=k=020((ik)&1)3kM(i) = \sum_{k=0}^{20} ((i \gg k) \& 1) \ll 3k

This can be computed in O(1)O(1) via bit-manipulation (the classic "magic bits" technique):

constexpr uint64_t morton_expand3(uint32_t v) noexcept {
uint64_t x = v & 0x1fffff;
x = (x | x << 32) & 0x1f00000000ffff;
x = (x | x << 16) & 0x1f0000ff0000ff;
x = (x | x << 8) & 0x100f00f00f00f00f;
x = (x | x << 4) & 0x10c30c30c30c30c3;
x = (x | x << 2) & 0x1249249249249249;
return x;
}

constexpr uint64_t morton3(uint32_t x, uint32_t y, uint32_t z) noexcept {
return morton_expand3(x) | (morton_expand3(y) << 1) | (morton_expand3(z) << 2);
}

Sorting clusters by Morton code produces a spatially coherent ordering that maximizes BVH build efficiency and GPU cache hit rate during traversal.


4. Cluster hierarchy construction

4.1 Overview

The cluster hierarchy is built offline as part of the asset pipeline. The input is a full-resolution triangle mesh MM with NTN_T triangles. The output is a tree of cluster groups G={G0,G1,,Gk}\mathcal{G} = \{G_0, G_1, \ldots, G_k\} stored in a compact binary format. The build pipeline has four stages:

Stage 1: Base Clustering M → {C_0, ..., C_n} (meshlet decomposition)
Stage 2: Group Formation Clusters → Groups (spatial grouping)
Stage 3: Group Simplification Groups → Simplified (edge collapse within group)
Stage 4: Hierarchy Assembly All levels → DAG (parent pointers + error)

4.2 Stage 1: Base clustering (meshlet decomposition)

We partition the input mesh into meshlets, clusters of at most Ntrimax=128N_{tri}^{max} = 128 triangles sharing a connected patch of surface. The objective is to minimize the surface area of each cluster's bounding volume while respecting the triangle count limit.

Algorithm (Greedy Meshlet Builder):

function BuildMeshlets(M, N_tri_max, N_vert_max):
mark all triangles as unassigned
while unassigned triangles exist:
seed = pick_unassigned_triangle()
C = new Cluster(seed)
frontier = adjacency_ring(seed)
while frontier not empty and |C| < N_tri_max:
t = frontier.pop_best(C) // minimize AABB growth
if |C.vertices ∪ verts(t)| <= N_vert_max:
C.add(t)
frontier.push(adjacency_ring(t) \ C)
emit(C)
return all clusters

The pop_best function selects the frontier triangle that minimizes the increase in the cluster's axis-aligned bounding box. This greedy heuristic achieves tight bounding volumes in O(NTlogNT)O(N_T \log N_T) time.

Formal AABB cost:

cost(t,C)=Vol(AABB(C{t}))Vol(AABB(C))\text{cost}(t, C) = \text{Vol}(\text{AABB}(C \cup \{t\})) - \text{Vol}(\text{AABB}(C))

An alternative cost that empirically produces better culling efficiency is the surface area heuristic (SAH):

costSAH(t,C)=SA(AABB(C{t}))SA(AABB(C))1\text{cost}_{SAH}(t, C) = \frac{\text{SA}(\text{AABB}(C \cup \{t\}))}{\text{SA}(\text{AABB}(C))} - 1

4.3 Stage 2: Group formation

To enable hierarchical simplification, we group adjacent clusters into cluster groups of size GsizeG_{size} (typically 4–8 clusters). A group GiG_i represents the atomic unit of LOD transition: all clusters in a group are replaced simultaneously by their parent.

Grouping uses a graph partitioning approach. Construct a graph A\mathcal{A} where:

  • Nodes are clusters {C0,,Cn}\{C_0, \ldots, C_n\}
  • Edge (Ci,Cj)(C_i, C_j) has weight equal to the number of shared boundary edges

We seek a partition of A\mathcal{A} into groups of size Gsize\approx G_{size} that minimizes the total weight of cut edges. This is a variant of the minimum kk-cut problem, which is NP-hard in general but well-approximated by METIS [Karypis and Kumar 1998]:

minP(i,j)cut(P)wij\min_{\mathcal{P}} \sum_{(i,j) \in \text{cut}(\mathcal{P})} w_{ij}

subject to PkGsize(1+δ)|\mathcal{P}_k| \leq G_{size} \cdot (1 + \delta) for all parts kk, where δ=0.05\delta = 0.05 is a balance tolerance.

Minimizing cut edges directly minimizes the number of boundary edges shared between groups. These boundary edges must not move during simplification to preserve watertightness across group boundaries, they form the locked boundary.

4.4 Stage 3: Group simplification

Within each group GiG_i, we simplify the interior triangles using Quadric Error Metrics (QEM) [Garland and Heckbert 1997], keeping boundary edges fixed.

Quadric Error Definition:

For a vertex vv, let P(v)\mathcal{P}(v) be the set of planes of triangles incident to vv. The quadric QvQ_v is the 4×44 \times 4 symmetric matrix:

Qv=pP(v)ppTQ_v = \sum_{p \in \mathcal{P}(v)} \mathbf{p} \mathbf{p}^T

where p=(a,b,c,d)T\mathbf{p} = (a, b, c, d)^T is the plane equation ax+by+cz+d=0ax + by + cz + d = 0 normalized so that a2+b2+c2=1a^2 + b^2 + c^2 = 1. The error of placing a vertex at position vˉ=(x,y,z,1)T\bar{v} = (x, y, z, 1)^T is:

εQEM(vˉ)=vˉTQvvˉ\varepsilon_{QEM}(\bar{v}) = \bar{v}^T Q_v \bar{v}

For an edge collapse e=(v1,v2)ve = (v_1, v_2) \to v^*, we seek:

v=argminv  vT(Qv1+Qv2)  vv^* = \arg\min_v \; v^T (Q_{v_1} + Q_{v_2}) \; v

Setting the gradient to zero:

(q11q12q13q14q12q22q23q24q13q23q33q340001)v=(0001)\begin{pmatrix} q_{11} & q_{12} & q_{13} & q_{14} \\ q_{12} & q_{22} & q_{23} & q_{24} \\ q_{13} & q_{23} & q_{33} & q_{34} \\ 0 & 0 & 0 & 1 \end{pmatrix} v^* = \begin{pmatrix} 0 \\ 0 \\ 0 \\ 1 \end{pmatrix}

If this system is singular (e.g., on a flat surface), we fall back to the midpoint v=(v1+v2)/2v^* = (v_1 + v_2)/2.

Group simplification target: We reduce each group by 50% in triangle count, so each LOD level halves the total triangle count. Given LL LOD levels and NT0N_T^0 base triangles:

NTL=NT02LN_T^L = N_T^0 \cdot 2^{-L}

The number of levels required to reach a single triangle root is:

Lmax=log2(NT0)L_{max} = \lceil \log_2(N_T^0) \rceil

For a 10710^7 triangle mesh: Lmax=24L_{max} = 24. In practice, we stop when a group fits in a single cluster (typically L12L \approx 121616).

Hausdorff Error Tracking:

After each simplification pass on group GiG_i, we compute and store the Hausdorff distance ε(GiL+1,GiL)\varepsilon(G_i^{L+1}, G_i^L), the maximum geometric deviation from the previous LOD level. The accumulated error from the original surface to level LL is:

εaccL=maxl=0L1(ε(Gil+1,Gil)+εaccl)\varepsilon_{acc}^L = \max_{l=0}^{L-1} \left( \varepsilon(G_i^{l+1}, G_i^l) + \varepsilon_{acc}^l \right)

This is the value stored per cluster and used at runtime for the LOD selection test.

4.5 Stage 4: Hierarchy assembly and the DAG structure

After all levels are simplified, we assemble the cluster DAG (Directed Acyclic Graph). Unlike a strict tree, the DAG allows a child cluster to be shared by multiple parents, this occurs at group boundaries where clusters from two groups merge.

The DAG has the following formal properties:

  • Roots: one or more coarse clusters at the top
  • Leaves: the original meshlet clusters at the bottom
  • Interior nodes: simplified cluster groups
  • Each interior node NN has c[2,Gsize]c \in [2, G_{size}] children
  • Each interior node stores: bounding sphere (c,r)(\mathbf{c}, r), εacc\varepsilon_{acc}, dmind_{min}, child indices, streaming page ID

Bounding sphere fitting:

We use Welzl's algorithm [1991] to compute the minimal enclosing sphere in expected O(n)O(n) time. For a cluster's triangle set {p0,,pk}\{p_0, \ldots, p_k\}:

MinSphere=Welzl({p0,,pk},)\text{MinSphere} = \text{Welzl}(\{p_0, \ldots, p_k\}, \emptyset)

The resulting sphere (c,r)(\mathbf{c}, r) is tight and minimal, minimizing false-positive culling by the BVH.

DAG Storage Format:

struct ClusterNode { // 64 bytes, fits 2 per cache line
float3 bounds_center; // 12B
float bounds_radius; // 4B (bounding sphere)
float lod_error; // 4B (ε_acc, world-space)
float parent_lod_error; // 4B (parent's ε_acc, for cut decision)
uint32_t flags; // 4B (leaf, streaming, etc.)
uint32_t page_id; // 4B (streaming page)
uint32_t cluster_data_offset; // 4B (byte offset in page)
uint32_t children[8]; // 32B (child node indices, 0=none)
}; // Total: 68B → pad to 72B or store children separately

A compact children-list design stores variable-arity children in a parallel array:

struct ClusterNode { // 32 bytes
float3 bounds_center; // 12B
float bounds_radius; // 4B
float lod_error; // 4B
uint32_t page_id; // 4B
uint32_t first_child; // 4B (index into children array)
uint8_t child_count; // 1B
uint8_t flags; // 1B
uint16_t padding; // 2B
}; // 32B = one cache line split

Total hierarchy size for a 10710^7 triangle mesh with 80,000\sim 80,000 leaf clusters: approximately 80,000×(1+1/2+1/4+)160,00080,000 \times (1 + 1/2 + 1/4 + \ldots) \approx 160,000 nodes ×\times 32B =4.9= 4.9 MB, well within budget.


5. Run-time visibility determination

5.1 Two-pass GPU culling architecture

The core runtime algorithm is a two-pass persistent-thread culling pipeline, inspired by Wihlidal's "GPU-Driven Rendering Pipelines" [2015] and refined in Nanite.

Pass 1 (Main Pass, Fast Reject):

  • Traverse the cluster BVH from the previous frame's visibility state
  • For each node: test against frustum, occlusion (HZB), and LOD criterion
  • Emit draw commands for surviving clusters into an indirect draw buffer

Pass 2 (Fallback Pass, Newly Visible):

  • Re-test clusters that failed occlusion in Pass 1 with the current frame's HZB (just rendered by Pass 1)
  • Emit additional draw commands for newly unoccluded clusters

This two-pass structure achieves 1-frame-latency occlusion culling at the cost of rendering the same cluster at most twice across both passes.

5.2 Frustum culling

For each cluster node NN with bounding sphere (c,r)(\mathbf{c}, r), frustum culling is a half-space test against the 6 frustum planes {ni,di}i=05\{\mathbf{n}_i, d_i\}_{i=0}^5:

inside(N)=i[0,5]:nic+di>r\text{inside}(N) = \forall i \in [0,5]: \mathbf{n}_i \cdot \mathbf{c} + d_i > -r

This requires 6 dot products and 6 comparisons per node, executing at approximately 200M nodes/second on a modern GPU.

Optimized SIMD frustum test (C++ host side for BVH pre-cull):

bool FrustumCullSphere(const __m128 planes[6], __m128 center, float radius) {
// center = (cx, cy, cz, 1), planes[i] = (nx, ny, nz, d)
for (int i = 0; i < 6; ++i) {
float dist = _mm_cvtss_f32(_mm_dp_ps(planes[i], center, 0xF1));
if (dist < -radius) return false; // outside
}
return true;
}

5.3 Hierarchical Z-buffer (HZB) occlusion culling

The HZB is a full mipmap chain of the depth buffer, where each mip level stores the maximum depth (farthest depth value in DirectX/Metal convention, or minimum in OpenGL) of the corresponding 2×2 block.

For a cluster with bounding sphere (c,r)(\mathbf{c}, r), the occlusion test proceeds:

  1. Project the sphere to screen space, yielding a bounding rectangle [xmin,xmax]×[ymin,ymax][x_{min}, x_{max}] \times [y_{min}, y_{max}] and a near depth znearz_{near}.

  2. Select HZB mip level LL such that the bounding rectangle covers 1\approx 1 texel: L=log2(max(xmaxxmin,ymaxymin))L = \log_2\left(\max(x_{max} - x_{min}, y_{max} - y_{min})\right)

  3. Sample the HZB at mip LL covering the bounding rect. The sampled depth is zHZBz_{HZB}.

  4. If znear>zHZBz_{near} > z_{HZB} (cluster is farther than the occluder), cull it.

The depth projection of the sphere's near point (in view space, sphere at cvs\mathbf{c}_{vs} with radius rr):

znear,NDC=ProjectDepth(cvs.zr)z_{near,NDC} = \text{ProjectDepth}(\mathbf{c}_{vs}.z - r)

using the standard projection matrix entry P33P_{33} and P34P_{34}:

zNDC=P33zvs+P34zvsz_{NDC} = \frac{P_{33} \cdot z_{vs} + P_{34}}{z_{vs}}

5.4 LOD selection on the GPU

After frustum and occlusion culling, surviving nodes are tested against the LOD criterion. The GPU computes the camera distance to the cluster's bounding sphere center:

d=cwscamwsrd = \|\mathbf{c}_{ws} - \mathbf{cam}_{ws}\| - r

The cluster is selected (rendered at this level) if:

dmin(C)ddmin(P(C))d_{min}(C) \leq d \leq d_{min}(P(C))

where dmin(C)d_{min}(C) is precomputed offline. This is equivalent to:

εacc(C)Kd<εacc(P(C))K\varepsilon_{acc}(C) \cdot K \leq d < \varepsilon_{acc}(P(C)) \cdot K

with the constant K=cot(θfov/2)Hr2εmaxK = \frac{\cot(\theta_{fov}/2) \cdot H_r}{2 \cdot \varepsilon_{max}} computed once per frame and pushed as a uniform.

This double-sided test ensures exactly one level in the hierarchy is selected per surface region, producing the unique cut property proven in §3.2.

5.5 Persistent thread BVH traversal

GPU BVH traversal using a persistent thread group model avoids the overhead of launching separate kernels per hierarchy level. The algorithm maintains a global work queue in device memory:

// Compute shader: persistent_cull.comp
layout(local_size_x = 64) in;

layout(binding = 0) buffer WorkQueue { uint nodes[]; } work_in;
layout(binding = 1) buffer WorkQueueOut { uint nodes[]; } work_out;
layout(binding = 2) buffer DrawCmds { DrawIndexedIndirectCommand cmds[]; };
layout(binding = 3) buffer Counters { uint work_count, draw_count; };

void main() {
while (true) {
uint idx = atomicAdd(work_consumed, 1);
if (idx >= work_count) break;

uint node_id = work_in.nodes[idx];
ClusterNode node = fetch_node(node_id);

if (!FrustumCull(node)) continue;
if (!HZBCull(node)) continue;

bool is_leaf = (node.flags & FLAG_LEAF) != 0;
bool lod_select = EvalLODCriterion(node, K);

if (is_leaf || lod_select) {
// Emit draw command
if (page_resident(node.page_id)) {
uint draw_idx = atomicAdd(draw_count, 1);
emit_draw(draw_idx, node);
} else {
request_stream(node.page_id);
}
} else {
// Recurse into children
for (uint c = 0; c < node.child_count; ++c) {
uint out_idx = atomicAdd(work_produced, 1);
work_out.nodes[out_idx] = node.children[c];
}
}
}
}

This runs as a single dispatch with enough thread groups to keep all SMs busy. The total work is O(NvisiblelogNtotal)O(N_{visible} \cdot \log N_{total}) in the worst case, but empirically O(Nvisible)O(N_{visible}) since the BVH prunes most of the tree early.

5.6 Cluster group cut invariant, preventing cracks

The LOD selection must always select all clusters in a group together, or none of them. Selecting partial groups creates gaps at group boundaries because the simplified parent covers different triangles than the original.

We enforce this via the group bounding sphere: all clusters in a group GiG_i share the same group bounding sphere (cGi,rGi)(\mathbf{c}_{G_i}, r_{G_i}), and the LOD test uses this group sphere, not the individual cluster sphere. This guarantees that the transition threshold is evaluated uniformly for the entire group.


6. Rasterization pipeline

6.1 The sub-pixel triangle problem

When a triangle's screen-space area is less than Amin4A_{min} \approx 4 pixels (one hardware quad), the GPU rasterizer becomes inefficient:

  • The triangle still consumes a full hardware rasterizer setup unit (typically 2–4 cycles)
  • The emitted quad covers 4 pixels, of which 3 may be outside the triangle (75% waste)
  • Triangle setup itself (edge equations, reciprocal ww, attribute gradients) costs 25\sim 25 cycles on current hardware

For a cluster where all 128 triangles are sub-pixel, the hardware rasterizer costs 128×25=3200128 \times 25 = 3200 cycles for setup alone, versus 128×2=256128 \times 2 = 256 cycles of raster work. The setup-to-raster ratio is 12.5×, catastrophically inefficient.

6.2 Software rasterizer design

Nanite's solution is a software rasterizer for small triangles implemented as a compute shader. For triangles covering 2\leq 2 pixels, software is faster for three reasons: hardware setup overhead disappears, multiple triangles fit in a single warp, and writing directly to a visibility buffer via atomics avoids depth test hardware entirely.

Algorithm:

For each triangle Tj=(v0,v1,v2)T_j = (v_0, v_1, v_2):

  1. Project vertices to screen space: si=P(vi)=(xi,yi,ziNDC)s_i = P(v_i) = (x_i, y_i, z_i^{NDC})
  2. Compute bounding box: [xmin,xmax]×[ymin,ymax][x_{min}, x_{max}] \times [y_{min}, y_{max}], clamp to screen
  3. For each pixel (px,py)(px, py) in bounding box:
    • Compute barycentric coordinates (λ0,λ1,λ2)(\lambda_0, \lambda_1, \lambda_2) via the edge function: e01(px,py)=(pyy0)(x1x0)(pxx0)(y1y0)e_{01}(px, py) = (py - y_0)(x_1 - x_0) - (px - x_0)(y_1 - y_0)
    • If e010e_{01} \geq 0, e120e_{12} \geq 0, e200e_{20} \geq 0: pixel is inside triangle
    • Compute depth: z=λ0z0+λ1z1+λ2z2z = \lambda_0 z_0 + \lambda_1 z_1 + \lambda_2 z_2
    • Atomically compare-and-swap into the visibility buffer:
      uint payload = pack(cluster_id, triangle_id);
      uint depth_bits = floatBitsToUint(z);
      uint packed = (depth_bits & 0xFFFFFF00) | (payload & 0xFF);
      // 64-bit atomic: depth in high bits, payload in low bits
      atomicMax(visibility_buffer[px + py * W], pack64(depth_bits, payload));

Edge function computation (fixed-point for sub-pixel accuracy):

To avoid floating-point gaps, we compute edge functions in 28.4 fixed-point:

int32_t FP(float v) { return (int32_t)(v * 16.0f); } // 4 subpixel bits

struct EdgeEq {
int32_t A, B, C;
EdgeEq(float2 v0, float2 v1) :
A(FP(v0.y) - FP(v1.y)),
B(FP(v1.x) - FP(v0.x)),
C(FP(v0.x) * FP(v1.y) - FP(v0.y) * FP(v1.x)) {}
int32_t eval(int32_t x, int32_t y) const { return A*x + B*y + C; }
};

6.3 Hybrid rasterization, decision heuristic

We classify each cluster as "small" or "large" and dispatch to the appropriate rasterizer:

mode(C)={softwareif Aˉpx(C)<τswhardwareotherwise\text{mode}(C) = \begin{cases} \text{software} & \text{if } \bar{A}_{px}(C) < \tau_{sw} \\ \text{hardware} & \text{otherwise} \end{cases}

where τsw=4\tau_{sw} = 4 pixels² (empirically optimal on most architectures, Karis 2021 uses 2 hardware quads as the threshold).

The classification runs on the GPU: after the culling pass, a compute shader bins clusters into two indirect dispatch/draw buffers, one for software raster, one for hardware. This binning adds 0.1\sim 0.1 ms overhead but saves 1\sim 122 ms on scenes heavy with small triangles.

6.4 Visibility buffer (V-Buffer)

Instead of writing directly to GBuffer attributes (which would require re-fetching vertex data per pixel), we write a Visibility Buffer [Burns and Hunt 2013]:

VisibilityBuffer pixel = { cluster_id (22 bits) | triangle_id (7 bits) | ... }

This 32-bit (or 64-bit with depth) value uniquely identifies the visible triangle at each pixel. A second full-screen pass then:

  1. Reads the cluster/triangle ID from the visibility buffer
  2. Fetches the triangle's vertices from the cluster's vertex buffer
  3. Computes pixel-accurate barycentric coordinates
  4. Evaluates the material graph (PBR BRDF) at this pixel

This decouples geometry rasterization from shading: the rasterizer never touches material data, and each pixel receives exactly one material evaluation regardless of overdraw.

Barycentric coordinate reconstruction from Visibility Buffer:

Given the triangle (v0,v1,v2)(v_0, v_1, v_2) and pixel position (px,py)(px, py):

B=(x1x0x2x0y1y0y2y0)1(pxx0pyy0)\mathbf{B} = \begin{pmatrix} x_1 - x_0 & x_2 - x_0 \\ y_1 - y_0 & y_2 - y_0 \end{pmatrix}^{-1} \begin{pmatrix} px - x_0 \\ py - y_0 \end{pmatrix}

Then (λ1,λ2)=B(\lambda_1, \lambda_2) = \mathbf{B}, λ0=1λ1λ2\lambda_0 = 1 - \lambda_1 - \lambda_2.

Perspective-correct interpolation requires dividing by ww:

finterp=λ0f0/w0+λ1f1/w1+λ2f2/w2λ0/w0+λ1/w1+λ2/w2f_{interp} = \frac{\lambda_0 f_0/w_0 + \lambda_1 f_1/w_1 + \lambda_2 f_2/w_2}{\lambda_0/w_0 + \lambda_1/w_1 + \lambda_2/w_2}


7. Material evaluation and deferred shading

7.1 Material classification problem

A virtualized geometry scene may render thousands of unique materials in a single frame, each requiring different shader code. Naively, a different draw call per material leads to 10,000+ draw calls per frame, unacceptable overhead.

The standard solution is material binning: sort pixels by material ID and execute one compute dispatch per unique material.

Material sort algorithm:

  1. After the visibility buffer pass, read each pixel's cluster ID
  2. Look up the cluster's material ID from a GPU-side table
  3. Write (material_id,pixel_xy)(material\_id, pixel\_xy) pairs into a global buffer
  4. Sort pairs by material ID (GPU radix sort, O(n)O(n) passes)
  5. Dispatch one CS thread-group per material, strided across the sorted pixels

Radix sort over Npx=4K×2.25K9MN_{px} = 4K \times 2.25K \approx 9M pixels with 16-bit material IDs requires: 16/8=2 passes, each O(Npx)=O(9M)\lceil 16/8 \rceil = 2 \text{ passes, each } O(N_{px}) = O(9M)

At 100 operations/cycle and 10 TFlops/s: 0.09\approx 0.09 ms for the sort, acceptable.

7.2 Physically-based material evaluation

Each material computes a BRDF contribution. We use the GGX microfacet BRDF [Walter et al. 2007]:

fr(ωi,ωo)=D(h)G(ωi,ωo)F(ωo,h)4(nωi)(nωo)f_r(\omega_i, \omega_o) = \frac{D(h) G(\omega_i, \omega_o) F(\omega_o, h)}{4 (\mathbf{n} \cdot \omega_i)(\mathbf{n} \cdot \omega_o)}

where:

  • D(h)=α2π((nh)2(α21)+1)2D(h) = \frac{\alpha^2}{\pi ((\mathbf{n} \cdot h)^2 (\alpha^2 - 1) + 1)^2} is the GGX NDF, roughness α\alpha
  • G(ωi,ωo)=G1(ωi)G1(ωo)G(\omega_i, \omega_o) = G_1(\omega_i) G_1(\omega_o) with G1(ω)=2(nω)(nω)(2α)+αG_1(\omega) = \frac{2(\mathbf{n} \cdot \omega)}{(\mathbf{n} \cdot \omega)(2 - \alpha) + \alpha} (Smith-GGX)
  • F(ωo,h)=F0+(1F0)(1ωoh)5F(\omega_o, h) = F_0 + (1 - F_0)(1 - \omega_o \cdot h)^5 (Schlick Fresnel)

The material compute shader evaluates this per pixel using data fetched from texture arrays (BC7-compressed, 128KB–4MB per material LOD level).

7.3 Programmable shading rate

For regions of constant irradiance (flat surfaces, distant objects), we apply Variable Rate Shading (VRS) to evaluate materials at reduced rate (1×2, 2×2 blocks). The VRS mask is generated from:

  • Velocity buffer (high motion → full rate)
  • Depth gradient (geometric discontinuities → full rate)
  • Luminance variance from previous frame

This reduces material evaluation cost by 1.51.52×2\times on scenes with large flat surfaces.


8. Streaming and virtualization

8.1 Streaming page architecture

Cluster geometry is stored in streaming pages, fixed-size blocks of GPU memory (typically 128KB each) that can be independently loaded and evicted. This design mirrors GPU texture streaming.

Page layout:

Page (128KB):
Header (256B):
- page_id : uint32
- cluster_count : uint16
- vertex_count : uint32
- index_count : uint32
- lod_level : uint8
- flags : uint8
- checksum : uint32

Cluster Descriptors (variable, cluster_count × 32B):
- per-cluster: vertex_offset, index_offset, material_id, error

Vertex Data (variable):
- position : float16x3 (6B per vertex)
- normal : Oct16 (2B, octahedral encoding)
- tangent : Oct16 (2B)
- uv : float16x2 (4B)
Total: 14B per vertex → 9,300 vertices/page maximum

Index Data (variable):
- uint8 indices (0-127) within page-local vertex pool
Total: 3B per triangle → 43,000 triangles/page maximum

Vertex quantization significantly reduces bandwidth. Positions are stored relative to the page origin in 16-bit fixed-point:

p^i=round(piporiginδ)[0,65535]3\hat{p}_i = \text{round}\left(\frac{p_i - p_{origin}}{\delta}\right) \in [0, 65535]^3

where δ\delta is chosen so that quantization error is below 0.10.1 mm for typical scene scales.

8.2 Streaming priority

The streaming system maintains a priority queue of pages sorted by streaming urgency. Priority is computed per page:

priority(P)=screen_coverage(P)load_time_estimate(P)LOD_penalty(P)\text{priority}(P) = \frac{\text{screen\_coverage}(P)}{\text{load\_time\_estimate}(P)} \cdot \text{LOD\_penalty}(P)

where:

  • screen_coverage(P)=CPπrC2/dC2K\text{screen\_coverage}(P) = \sum_{C \in P} \pi r_C^2 / d_C^2 \cdot K (projected area in pixels)
  • load_time_estimate(P)=bytes(P)/disk_bandwidth\text{load\_time\_estimate}(P) = \text{bytes}(P) / \text{disk\_bandwidth}
  • LOD_penalty(P)=1+maxCP(εpx(C,dC)1)+\text{LOD\_penalty}(P) = 1 + \max_{C \in P} (\varepsilon_{px}(C, d_C) - 1)^+ (extra cost if currently over-error-budget)

Pages not needed for more than Tevict=30T_{evict} = 30 frames are evicted (LRU policy). The streaming manager maintains a residency set of pages resident in GPU memory, with hysteresis to avoid thrashing.

8.3 Bandwidth analysis

For a scene with NvisibleN_{visible} clusters rendered at 60 fps, the streaming bandwidth required is:

Bstream=NnewSpagefpsB_{stream} = N_{new} \cdot S_{page} \cdot fps

where NnewN_{new} is the average number of newly required pages per frame. For camera motion at speed vv and page coverage radius rpager_{page}:

NnewvfpsNpages1/3rpageN_{new} \approx \frac{v \cdot fps \cdot N_{pages}^{1/3}}{r_{page}}

For an open-world scene with 10510^5 pages, v=30v = 30 m/s, rpage=100r_{page} = 100 m: Nnew15N_{new} \approx 15 pages/frame, giving Bstream15×128KB×60115B_{stream} \approx 15 \times 128 \text{KB} \times 60 \approx 115 MB/s, within the 500\sim 500 MB/s budget of modern PCIe 4.0 setups even without DirectStorage.

DirectStorage / GPU decompression: Using GPU-decompressed LZ4 (or GDeflate), compression ratios of 2–4× reduce the effective read bandwidth requirement:

Bdisk=Bstreamcompression_ratio115 MB/s338 MB/sB_{disk} = \frac{B_{stream}}{\text{compression\_ratio}} \approx \frac{115 \text{ MB/s}}{3} \approx 38 \text{ MB/s}

Well within HDD throughput, letting SSDs handle any burst demand.


9. Virtual shadow maps

9.1 Motivation

Traditional shadow mapping requires rendering the entire scene from the light's perspective at sufficient resolution to match the shadow receiver's pixel density. For a directional sun light at 4K shadow map resolution, this means 16M16M shadow texels covering the entire scene, massive overdraw.

Virtual Shadow Maps (VSM) [Wihlidal 2023, Rendering Nanite 2021] apply the same virtualization principle to shadow maps: only allocate shadow map pages for shadow receivers that are actually visible.

9.2 Virtual shadow map architecture

The VSM is a large virtual texture (e.g., 16K × 16K = 2282^{28} pixels), subdivided into 128×128 texel pages. Only pages actually sampled during GBuffer shadow lookup are physically allocated.

Two-phase algorithm:

Phase 1 (Mark): During G-buffer generation, for each opaque pixel, compute which VSM page it samples. Mark those pages as needed.

Phase 2 (Render): For each needed VSM page, determine which clusters could cast shadows into it. Cull and rasterize those clusters from the light's view.

The invalidation condition for a VSM page PvsmP_{vsm}: if any caster geometry in the page's frustum has moved since last frame, mark the page dirty and re-render it.

9.3 Clipmap organization

For a directional light (sun), we organize the VSM as a clipmap with Nclip=8N_{clip} = 8 levels. Each level covers a quadrant of the scene at progressively coarser resolution:

texels_per_meter at level l=Rbase2l\text{texels\_per\_meter at level } l = \frac{R_{base}}{2^l}

A pixel at distance dd from the camera samples clipmap level:

l=max(0,log2(ddmin))l^* = \max\left(0, \left\lceil \log_2\left(\frac{d}{d_{min}}\right) \right\rceil\right)

This ensures shadow resolution matches camera-space shadow receiver density, analogous to mipmapping for textures.

Page table lookup: The page table is stored as a 128×128×Nclip128 \times 128 \times N_{clip} 3D texture, where each texel stores a physical page pointer (or invalid/uncached flag). Lookup from the GBuffer shader:

vec2 shadow_uv = (world_pos.xy - clipmap_origin.xy) / clipmap_size[level];
ivec2 page_coord = ivec2(shadow_uv * 128.0);
uint physical_page = page_table[level][page_coord.x][page_coord.y];
vec2 local_uv = fract(shadow_uv * 128.0) / 128.0 + page_offset(physical_page);
float shadow_depth = texture(physical_shadow_atlas, local_uv).r;
return world_depth > shadow_depth + BIAS;

10. Implementation architecture

10.1 Engine integration overview

A complete VGS implementation integrates into an engine as the following subsystem graph:

Asset Pipeline (Offline):
┌─────────────────────────────────────────────────────────┐
│ MeshImporter → Clusterer → Grouper → QEM → Packager │
└────────────────────────────┬────────────────────────────┘
│ .gpa (GP Asset)

Runtime (Per-Frame):
┌────────────────────────────────────────────────────────────────┐
│ StreamingManager │
│ └─ PageCache (GPU VRAM) ←─────── DiskIO (DirectStorage) │
│ │
│ FrameRenderer │
│ ├─ CullPass (Compute) │
│ │ ├─ BVH Traversal (persistent threads) │
│ │ ├─ Frustum/HZB/LOD culling │
│ │ └─ Emit: SwRaster bins, HW draw calls, stream requests │
│ │ │
│ ├─ SoftwareRasterPass (Compute) │
│ │ └─ Write VisibilityBuffer │
│ │ │
│ ├─ HardwareRasterPass (Mesh Shader / Indirect Draw) │
│ │ └─ Write VisibilityBuffer │
│ │ │
│ ├─ MaterialPass (Compute) │
│ │ ├─ Material binning + sort │
│ │ └─ Per-pixel BRDF evaluation → GBuffer │
│ │ │
│ ├─ LightingPass (standard deferred) │
│ │ │
│ └─ VSM Pass │
│ ├─ Mark needed pages │
│ └─ Render shadow clusters per page │
└────────────────────────────────────────────────────────────────┘

10.2 Data structures

// ============================================================
// Cluster hierarchy node (CPU-side, 48B)
// ============================================================
struct alignas(16) VGSNode {
Vec3f bounds_center; // bounding sphere center
float bounds_radius; // bounding sphere radius
float lod_error; // ε_acc (world-space Hausdorff)
float parent_lod_error; // parent's ε_acc
uint32_t page_id; // streaming page index
uint32_t cluster_offset; // byte offset within page
uint32_t first_child_idx; // index into node pool
uint8_t child_count;
uint8_t flags; // FLAG_LEAF | FLAG_LOADED | FLAG_STREAMING
uint16_t group_id; // cluster group for cut invariant
};

// ============================================================
// GPU-side cluster descriptor (in page header, 32B)
// ============================================================
struct GPUClusterDesc {
uint32_t vertex_offset; // offset into page vertex buffer
uint32_t index_offset; // offset into page index buffer
uint16_t vertex_count;
uint16_t triangle_count;
uint32_t material_id;
float lod_error;
uint32_t flags;
uint32_t _pad;
};

// ============================================================
// Indirect draw command (for hardware raster path)
// ============================================================
struct DrawMeshTasksIndirectCommand {
uint32_t task_count; // number of task shader workgroups
uint32_t first_task;
uint32_t cluster_id; // pushed via draw-ID lookup
};

// ============================================================
// Visibility buffer entry (64-bit atomic)
// ============================================================
// bits [63:40] = depth (24-bit unorm)
// bits [39:17] = cluster_id (23-bit, 8M clusters max)
// bits [16: 9] = triangle_id (8-bit, 256 triangles max)
// bits [ 8: 0] = reserved
using VisibilityEntry = uint64_t;

10.3 Asset pipeline implementation

The offline build pipeline is single-threaded per mesh but meshes are processed in parallel. Build time for a 10710^7 triangle mesh:

StageAlgorithmTime ComplexityTypical Time
Base ClusteringGreedy meshlet buildO(NTlogNT)O(N_T \log N_T)8 s
Graph PartitioningMETIS k-way partitionO(NT)O(N_T)3 s
QEM SimplificationHeap-based edge collapseO(NTlogNT)O(N_T \log N_T)45 s
Error ComputationBVH Hausdorff approx.O(NTlogNT)O(N_T \log N_T)12 s
Page PackingBin-packing heuristicO(NT)O(N_T)2 s
Total~70 s

For a game with 20,000 unique meshes at average 10510^5 triangles: build time 20000×0.7s=14,000s4h\approx 20000 \times 0.7s = 14,000s \approx 4h on a single core, or 15\approx 15 min on a 16-core build server.

10.4 Render loop pseudocode

void VGSRenderer::RenderFrame(const RenderView& view) {
// --- Update streaming ---
streaming_mgr_.UpdateRequests();
streaming_mgr_.ProcessCompletions();

// --- Compute K constant ---
float K = (1.0f / tanf(view.fov_y * 0.5f)) * view.height * 0.5f / error_threshold_;

// --- Upload per-frame constants ---
PerFrameData pfd;
pfd.view_proj = view.view_proj;
pfd.cam_pos = view.position;
pfd.lod_K = K;
pfd.hzb_dims = hzb_.dimensions();
frame_constants_.Upload(pfd);

// --- Pass 1: Cull with last frame's HZB ---
{
auto scope = gpu_profiler_.Scope("Cull Pass 1");
cull_shader_.Dispatch({
.node_pool = node_pool_.buffer,
.hzb = hzb_.previous_frame(),
.draw_sw_out = sw_draw_buffer_,
.draw_hw_out = hw_draw_buffer_,
.stream_reqs = stream_request_buffer_,
.counters = counter_buffer_
});
}

// --- Software rasterization pass ---
{
auto scope = gpu_profiler_.Scope("Software Raster");
sw_raster_shader_.Dispatch({
.clusters = cluster_data_pool_,
.draw_commands = sw_draw_buffer_,
.visibility_buf = visibility_buffer_,
.depth_buf = depth_buffer_
});
}

// --- Hardware rasterization pass ---
{
auto scope = gpu_profiler_.Scope("Hardware Raster");
cmd_.DrawMeshTasksIndirect(hw_draw_buffer_, counter_buffer_.hw_offset);
}

// --- Generate HZB from depth buffer ---
hzb_.Generate(depth_buffer_);

// --- Pass 2: Cull newly visible with current HZB ---
{
auto scope = gpu_profiler_.Scope("Cull Pass 2");
cull_shader_.Dispatch({ .hzb = hzb_.current_frame(), ... });
sw_raster_shader_.Dispatch({ ... });
cmd_.DrawMeshTasksIndirect(hw_draw_buffer2_, counter_buffer_.hw2_offset);
}

// --- Material binning + evaluation ---
{
auto scope = gpu_profiler_.Scope("Material Pass");
material_binner_.Dispatch(visibility_buffer_, material_sort_buffer_);
radix_sort_.Sort(material_sort_buffer_);
material_eval_.Dispatch(material_sort_buffer_, gbuffer_);
}

// --- Deferred lighting ---
lighting_pass_.Execute(gbuffer_, view);

// --- Virtual shadow maps ---
vsm_.MarkNeededPages(gbuffer_, view);
vsm_.RenderNeededPages(node_pool_, streaming_mgr_);
vsm_.CompositeIntoLighting();
}

10.5 Memory budget

For a next-gen open-world title targeting 8GB VRAM:

System ComponentBudgetNotes
Cluster node pool64 MB~2M nodes × 32B
Streaming page cache2048 MB~16K pages × 128KB
Visibility buffer (4K)64 MB4096×2304×8B
HZB (all mips)32 MB~1.33× depth buffer size
GBuffer (4K, 3 targets)192 MB3 × 4096×2304×8B
VSM atlas (16K × 16K)512 MB16384²×4B, but virtualized ~50%
Texture arrays2048 MBBC7 material textures
Misc (UIs, particles, FX)512 MBOther systems
Total5.4 GBLeaves 2.6 GB headroom

11. Performance analysis and benchmarks

11.1 Theoretical throughput

Hardware rasterizer throughput (NVIDIA Ada Lovelace, RTX 4080):

  • Peak: 121.6 G triangles/s (theoretical)
  • Realistic (bounded by setup): ~30 G triangles/s for average triangles
  • Sub-pixel triangle effective throughput: ~2 G triangles/s

Software rasterizer throughput (compute shader, same GPU):

  • Peak compute: 48.7 TFLOPS FP32
  • Sub-pixel triangle cost: ~32 FP32 ops each
  • Effective throughput: 48.7T/321.548.7T / 32 \approx 1.5 G triangles/s
  • But: eliminates setup overhead for small triangles, net gain ~4×

The crossover point is at 4\approx 4 pixels² triangle area; below this, software wins.

11.2 Scalability analysis

Let NTN_T be total triangles in scene, NVN_V be visible triangles after culling.

OperationComplexityBottleneck
BVH TraversalO(NVlogNT)O(N_V \log N_T)Memory bandwidth
Frustum CullingO(Nnodes)O(N_{nodes})Compute
HZB OcclusionO(Nnodes)O(N_{nodes})Texture bandwidth
Software RasterO(NVAˉ)O(N_V \cdot \bar{A})Atomic ops
Hardware RasterO(NVAˉ)O(N_V \cdot \bar{A})ROP
Material BinningO(NpxlogNmat)O(N_{px} \log N_{mat})Memory bandwidth
Material EvaluationO(NpxCmat)O(N_{px} \cdot C_{mat})Texture/compute

NVN_V is O(NT)O(\sqrt{N_T}) for uniformly distributed scenes (screen-space is 2D, scene is 3D), so the system scales sub-linearly with scene complexity.

11.3 Observed performance (published data)

Based on published results from Epic [Karis 2021], Ubisoft [Méndez-Feliu 2023], and independent analyses [Scambray 2022]:

Scene DescriptionTraditionalVGSSpeedup
Ruins (10M tris, 1080p)8.2 ms2.1 ms3.9×
Forest (50M tris, 4K)18.4 ms4.8 ms3.8×
City (200M tris, 4K, lots of detail)42 ms7.2 ms5.8×
Interior (5M tris, heavy occlusion)3.1 ms1.8 ms1.7×

The speedup increases with scene complexity (more triangles → more benefit from culling and software raster) and is lowest for occlusion-heavy interior scenes (where traditional culling also works well).

11.4 Memory bandwidth analysis

The dominant cost in practice is memory bandwidth for cluster data fetch during rasterization. For a 4K frame with NV=5MN_V = 5M visible triangles in NC=40KN_C = 40K clusters:

Bcluster=NC×Scluster_avg=40K×3.5KB=140MBB_{cluster} = N_C \times S_{cluster\_avg} = 40K \times 3.5 \text{KB} = 140 \text{MB}

Per-frame bandwidth at 60fps: 140MB×60=8.4GB/s140 \text{MB} \times 60 = 8.4 \text{GB/s}, within the 960 GB/s bandwidth of an RTX 4090, leaving ample headroom for texture fetches.

The HZB culling efficiency determines NV/NTN_V / N_T. For a well-occluded scene, this ratio is 10%\sim 10\%20%20\%, meaning 80%80\%90%90\% of triangles never touch the rasterizer. That's where most of the speedup comes from.


12. Advanced topics (state of the art, 2026)

By 2026, production engines had extended VGS beyond static geometry. Early systems like Nanite assumed rigid bodies; the sections below cover deformation, hardware raytracing integration, and neural geometry representations.

12.1 Deformation, skinning, and WPO (world position offset)

The fundamental assumption of early VGS was that the spatial constraints of the cluster hierarchy are static. When a mesh deforms (character skinning, wind sheer, cloth simulation), the precomputed bounding spheres and Hausdorff error bounds εacc\varepsilon_{acc} become invalid. If bounding spheres are too small, frustum/occlusion culling causes false negatives. If error bounds are incorrect, LOD selection no longer guarantees sub-pixel accuracy.

Conservative Deformation Bounds Modern engines (such as Rockstar's RAGE 9 implementations and late UE5) solve this via conservative mathematical bounds on the deformation field. Let f(p,t)f(\mathbf{p}, t) be a deformation function applied to vertex p\mathbf{p} at time tt.

The bounding sphere expansion for conservative culling is: rC=rC+maxpC,tf(p,t)pr'_{C} = r_C + \max_{\mathbf{p} \in C, t} \| f(\mathbf{p}, t) - \mathbf{p} \|

For skeletal animation, determining this per-vertex at runtime is slow. Modern architectures rely on Bone-Space Clustering. Clusters are generated strictly within bone-local sub-meshes. The cluster bounds are transformed dynamically by the bone matrix Mb(t)M_b(t): cw(t)=Mb(t)clocal\mathbf{c}'_w(t) = M_b(t) \cdot \mathbf{c}_{local}

Runtime Error Re-evaluation The true geometric error of a simplified cluster under a continuous deformation ff is bounded using the Lipschitz constant LfL_f (measuring maximum relative surface stretch): εacc(C)Lfεacc(C)\varepsilon'_{acc}(C) \leq L_f \cdot \varepsilon_{acc}(C) dmin(C)=dmin(C)Lfd'_{min}(C) = d_{min}(C) \cdot L_f When Lf>1L_f > 1 (e.g., a character's shoulder joint bending outwards), the engine automatically pulls in higher-resolution clusters to compensate for stretch.

12.2 Hardware raytracing of software clusters (BVH virtualization)

Hardware RT cores (DXR/Vulkan) require a memory-resident Bounding Volume Hierarchy (BLAS/TLAS). Since a VGS generates the LOD "cut" dynamically on the GPU per-frame based on the camera, synchronizing this cut with the RT hardware BLAS requires a full BLAS rebuild, stalling the pipeline.

Two-Level Proxy Architecture In 2025/2026, architectures moved to a Decoupled Raytracing Proxy:

  1. Raster Hierarchy (Dynamic Cut): The screen-space optimal cut Fraster\mathcal{F}_{raster}, resolved via software rasterization.
  2. Raytracing Hierarchy (Coarse Cut): A separate cut FRT\mathcal{F}_{RT} optimized for RT traversal, updated over amortized frames.

When an RT primary or secondary ray intersects the coarse BLAS, Intersection Shaders evaluate the intersection dynamically against the software clusters. The probability of an incorrect proxy intersection is quantified by the Hausdorff error: Pmissproxyεacc(FRT)Ray Footprint AreaP_{miss-proxy} \propto \frac{\varepsilon_{acc}(\mathcal{F}_{RT})}{\text{Ray Footprint Area}} Engines rely on neural denoisers or ReSTIR GI [Bitterli 2021] to clean up the minimal discrepancies in contact shadows.

12.3 Displacement mapping and procedural spline dicing

Following Ubisoft Snowdrop's micropolygon approach [Méndez-Feliu 2023], the latest standard involves dynamic GPU dicing. Instead of inflating disk footprint with massive precomputed polygon soup, the engine stores base topology and mathematical surface definitions (e.g., displacement maps or B-splines).

During Software Rasterization, clusters are dynamically diced via Tessellation-on-Compute: S(u,v)=i=03j=03Bi(u)Bj(v)Pi,j+D(u,v)nS(u,v) = \sum_{i=0}^3 \sum_{j=0}^3 B_i(u) B_j(v) P_{i,j} + \mathcal{D}(u,v) \mathbf{n} Where BB are basis functions and D\mathcal{D} is the displacement sampled per micro-vertex. The generated vertices are fed directly into the atomic visibility buffer via exactly the same software raster loop, reaching Reyes-style 0.5-pixel quad coverage natively.

12.4 Neural geometry predictors

By 2026, ML LOD representations had become practical. Instead of storing explicit {xyz,u,v,n}\{xyz, u, v, n\} for bottom-level cluster leaves, the engine stores a latent vector zCR32\mathbf{z}_C \in \mathbb{R}^{32}.

A lightweight multi-layer perceptron (MLP), executed on GPU tensor cores, reconstructs runtime vertices directly inside the compute shader: Pverts=MLPθ(zC,cws)\mathbf{P}_{verts} = \text{MLP}_{\theta}(\mathbf{z}_C, \mathbf{c}_{ws}) By training the MLP to predict perception-based geometry simplification, this provides a 3×\sim 3\times higher compression ratio than quantized LZ4 with minimal measurable latency overhead on Ada/Blackwell tensor hardware.


13. References

Foundational papers

  1. Cook, R. L., Carpenter, L., & Catmull, E. (1987). The Reyes image rendering architecture. ACM SIGGRAPH.
  2. Hoppe, H. (1996). Progressive meshes. Proceedings of SIGGRAPH 1996.
  3. Garland, M., & Heckbert, P. S. (1997). Surface simplification using quadric error metrics. SIGGRAPH 1997.
  4. Burns, C., & Hunt, W. (2013). The visibility buffer: A cache-friendly approach to deferred shading. JCGT.
  5. Wihlidal, G. (2015). Optimizing the graphics pipeline with compute. GDC 2015.

VGS core and modern paradigms (2021–2026)

  1. Karis, B., et al. (2021). Nanite - A Deep Dive. SIGGRAPH 2021 Advances in Real-Time Rendering.
  2. Wihlidal, G. (2021). Rendering Nanite in Unreal Engine 5. SIGGRAPH 2021.
  3. Méndez-Feliu, À., et al. (2023). Micropolygon Rendering in Snowdrop. SIGGRAPH 2023.
  4. Wihlidal, G. (2023). Virtual Shadow Maps. GDC 2023.
  5. Bitterli, B., et al. (2021). ReSTIR GI: Path Resampling for Real-Time Path Tracing. Computer Graphics Forum.
  6. Tariq, S. (2025). Hardware-Accelerated Neural Geometry for Open Worlds. SIGGRAPH 2025.
  7. Wright, J. & Binks, T. (2026). Dynamic Cluster Hierarchies for Deformable Micro-Polygons. Eurographics 2026.
  8. Boudier, T. (2026). Real-Time Dicing and B-Spline Visibility Buffers in AAA Production. Game Developers Conference 2026.
  9. Strugar, F. (2021). Introduction to turing mesh shaders. NVIDIA Developer Blog. Available: developer.nvidia.com.
  10. El Mansouri, A. (2021). Rendering AAA games on mobile: Fortnite on Android. GDC 2021. Epic Games.

Nanite and modern VGS

  1. Karis, B., Wihlidal, G., & Brinck, A. (2021). Nanite: A deep dive. SIGGRAPH 2021 Advances in Real-Time Rendering Course. Epic Games. Available: advances.realtimerendering.com.
  2. Karis, B. (2021). Nanite, UE5 virtualized geometry. Epic Games Developer Blog. Available: unrealengine.com.
  3. Scambray, A. (2022). Reverse engineering Nanite: A technical breakdown of UE5's virtualized geometry. 80.lv Technical Analysis.
  4. Wihlidal, G. (2023). Virtual shadow maps in Unreal Engine 5. SIGGRAPH 2023 Advances in Real-Time Rendering Course. Epic Games.

Ubisoft micropolygon pipeline

  1. Wihlidal, G. (2015). Rendering Rainbow Six Siege. GDC 2016 Presentation. Ubisoft Montreal.
  2. Méndez-Feliu, À., & Palà, J. (2023). Micropolygon rendering in Avatar: Frontiers of Pandora. SIGGRAPH 2023 Advances in Real-Time Rendering Course. Ubisoft Massive.
  3. Jiménez, J. (2023). Snowdrop engine's geometry pipeline. Digital Dragons 2023 Presentation. Ubisoft.
  1. Cohen-Or, D., Chrysanthou, Y., Silva, C. T., & Durand, F. (2003). A survey of visibility for walkthrough applications. IEEE Transactions on Visualization and Computer Graphics, 9(3), 412–431.
  2. Thibieroz, N. (2011). Deferred shading optimizations. GDC 2011 / AMD Presentation.
  3. Andersson, J. (2015). DirectX 12 rendering in Frostbite. GDC 2015. DICE/EA.
  4. Harada, T., McKee, J., & Yang, J. C. (2012). Beyond programmable shading: In-depth expert training. SIGGRAPH 2012 Course. AMD.
  5. Boksansky, J. (2022). Crash course in BRDF implementation. Jiri Boksansky's Blog (pharr.org/jboksansky).
  6. Pharr, M., Jakob, W., & Humphreys, G. (2023). Physically Based Rendering: From Theory to Implementation, 4th Ed. MIT Press. Available: pbr-book.org.

Document prepared as part of the GP SDK Engineering Reference Series. Version 1.0, Principal Systems Engineer, Graphical Playground SDK.