Non-interleaved Deferred Shading of Interleaved Sample Patterns

June 25, 2008 at 11:19 pm (Papers)

Continuing with the theme of my last substantive post (it’s been awhile, but it was Keepin’ It Low-Res), here’s a paper from Graphics Hardware 2006 that deals with computing lighting at a resolution lower than the screen resolution. The exact application is in this case is Instant Radiosity style global illumination. As a refresh, instant radiosity approximates bounce lighting by tracing rays from the light sources and placing point lights where the ray intersects the scene to approximate reflected light. The radiance equation can then be approximated with a Monte Carlo integration of radiance computed by evaluating a set of point lights.

With a deferred rendering pipeline, you can reduce some of the shader bottlenecks you’d encounter evaluating hundreds of these placed point lights and shadow map comparisons by rendering proxies for each light source so that shading is computed only at relevant pixels. However, it would be even better if only had to evaluate a subset of those lights per-pixel. Obviously, that is the goal of this paper. The devil is in the details though, as performing incoherent lighting calculations (as you may do in a naive implementation) doesn’t jive so well on graphics hardware where coherency is key.

Let’s first talk about the naive implementation mentioned in the last paragraph. How might you first go about selectively shading different pixels with a different set of lights? You could start out by using dynamic flow control. Something like:

if ScreenSpacePos.x % 2 == 0 && ScreenSpacePos.y % 2 == 0

    for( int i = 0; i < nLightsSet1; i++ )
        // Compute shading for light i, in set 1
else if ScreenSpacePos.x % 2 == 1 && ScreenSpacePos.y % 2 == 0
    for( int i = 0; i < nLightsSet2; i++ )
       // Compute shading for light i, in set 2
else if ... // The other two cases

This is obviously bad for coherency. Within a group of pixels assigned to one SIMD, you are guaranteed to follow all four paths and essentially all four paths will be executed for each pixel in that SIMD group. So you will be doing 4x more work per-pixel than if all pixels in that group were from VPL set 1. You could similarly try and use a stencil buffer to mask all pixels for a given light set, but given the alternating light set pattern the stencil mask is incoherent and you won’t get any of the benefits of Hi-Stencil. The same thing goes for using depth to mask pixels (Hi-Z is trashed). I’m pretty sure in the last few generations of hardware that there isn’t a performance difference between using stencil or depth to cull computation, FYI.

I think you’re getting the idea. All of the GBuffer texels for a given light set should be computed coherently, so ideally all of the aforementioned GBuffer texels should be organized next to each other in the GBuffer. As an aside, it is important to understand that textures are block allocated on the GPU. They are allocated in 2D blocks because texture fetches are spatially coherent in 2D. When you fetch one texel, you’re typically going to fetch another one in the 2D neighborhood of the last fetch. Therefore when I say that GBuffer texels for a given light set should be next to each other in the Gbuffer I mean in blocks as depicted in Figure 1:

Figure 1 Left: Interleaved texels Center: Correctly coalesced texels Right: Non-cache friendly coalescence

In order to reorganize the GBuffer texels in such a fashion, the authors first suggest a one-pass approach. Say we have four VPL light sets. While we want them uniformly distributed across the viewport in a regular fashion so that we can interpolate the shaded values at the end, we want them organized in blocks to be coherent when shading. Texel (x,y) in the shuffled GBuffer should be mapped to texel ( (x % 2)*2 + x/2, (y%2)*2 + y/2) in the initial GBuffer. This will work fine, but the fetching of the initial GBuffer texels is incoherent (word count: 23) because they’re being fetched from all over the GBuffer instead of in a smaller neighborhood.

The authors suggest a two-pass approach instead. Rather than jumping straight to the shuffling described in the last paragraph, the shuffling is performed in cache-friendly blocks.

The shading is then performed on the reorganized GBuffer and then a de-swizzle or “gathering” step is performed to undo the mapping that was performed in previous steps. When the shaded pixels are de-swizzled, each pixel only has the shading for one subset of the overall set of lights in the scene. The authors discuss some clever little tricks to quickly filter these shading values in a discontinuity-respecting manner so that the computed irradiance is smooth across the image. I’ll leave that for you to parse from the paper.

Non-interleaved Deferred Shading of Interleaved Sample Patterns

Benjamin Segovia, Jean-Claude Iehl , Richard Mitanchey and Bernard Péroche Proceedings of SIGGRAPH/Eurographics Workshop on Graphics Hardware 2006


  1. Real-Time Rendering » Blog Archive » Exploiting temporal and spatial coherence said,

    […] A different approach is discussed in Geometry-Aware Framebuffer Level of Detail (EGSR 2008, available here). Here the idea is to render certain quantities into a lower resolution frame buffer, using a joint bilateral filter to resample them during final rendering. As with the previous technique, care must be taken in selecting intermediate values; they should be both expensive to compute and vary slowly over the screen. This powerful acceleration technique was also used in the paper Image-Based Proxy Accumulation for Real-Time Soft Global Illumination (PG 2007, available here). Variations of this technique have been used in games, with perhaps the most common case being particle rendering, as the Level of Detail blog points out in this interesting post on the subject. The same blog also has insightful posts on many of the papers mentioned here, as well as another related paper. […]

  2. Imperfect Shadow Maps « Level of Detail said,

    […] from “Non-interleaved Deferred Shading of Interleaved Sample Patterns” (talked about here) and only process a subset of the VPLs at each […]

  3. level of detail » Blog Archive » Imperfect Shadow Maps said,

    […] from “Non-interleaved Deferred Shading of Interleaved Sample Patterns” (talked about here) and only process a subset of the VPLs at each […]

  4. Imperfect Shadow MapsJeremy Shopf said,

    […] from “Non-interleaved Deferred Shading of Interleaved Sample Patterns” (talked about here) and only process a subset of the VPLs at each […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: