Monday, March 26, 2012

Details of my first result post

So I think it's time to explain the process that went into below image:

First Pass: "render" the transparent triangles. The word render is in quotation marks because nothing is actually being drawn to the screen in this step, only stored in a global array to be processed later. I use the keyword discard to discard any writes to the framebuffer.

Initially I made the mistake of backing the global data with a 1D texture. Bad Idea. I found out that my graphics card only supports texture sizes of 16384 which is not enough when potentially millions of fragments need to be stored - and yes even with the simple purple cube. For example, when the front face of the cube covers the entire screen 800x800 screen, you need to store 640,000 fragment. Then add the back face. Now its 1,280,000 fragments. So I switched to a texture buffer instead. These are a nice hybrid between buffer objects and textures where I can still use imageLoad and imageStore, but I have 134,217,728 texels of space.

As an implementation detail, my global data array is composed of uvec4's and packs data accordingly:

uvec4 globalData;
globalData.x = prevHeadIndex;
globalData.y = floatBitsToUint(depth);
globalData.z = packUnorm2x16(finalColor.xy);
globalData.w = packUnorm2x16(;

If anything, prevHeadIndex is the most deserving of its 32 bits because of how large the global data array is (remember, millions.. a short would not cut it)

Second Pass: resolve the transparency data. At this point there is a filled global array and a 2D screen-sized array of linked list heads. The goal here is for each pixel on the screen to get its linked list head, and then backtrack through the global data to form a list of all the fragments that had been rasterized to this pixel, i.e. all the transparent fragments of this pixel. Then once you have a list of fragments that occupy this pixel cell, you sort (I used a bubble sort, but I will try others) from farthest depth to closest depth. Then loop over the sorted array and blend each fragment with the ones before it to produce a final color.

So if the screen is 800x800, there are 640,000 pixels that need to resolve their transparency (well, in the naive approach at least). This is done by rendering two triangles that cover the entire screen. Because they cover the entire screen, one fragment is rasterized per pixel - exactly enough. This is just a means to an end of running the transparency resolver shader program on every pixel.

Yes, the approach is a bit hacky, but it's relatively simple when these triangles are defined straight in clip space coordinates, making the Vertex Shader purely passthrough:

#version 420 core
layout(location = 0) in vec4 position;
void main()
gl_Position = position;
Known bugs: the transparent shapes are not blending with the background and always appear above the background. I figure these are pretty simple fixes.

First Results!

It's 7:39 in the morning... and no, I did not wake up early. But I finally have results for order independent transparency, even if its just a simple purple cube.

I will sum up what I did once I am less tired.

Thursday, March 22, 2012

Background Info

After reading two good papers on this subject (previously mentioned OpenGL Insights chapter and Order-Independent Transparency in DirectX11), I am here to summarize what I learned.

The core idea behind order independent transparency is that we need to store all the transparent fragments somewhere. ALL the transparent fragments. This means turning off the depth test to allow every fragment to be processed. With all this data, the first approach is to keep a per-pixel linked list. The second approach is to pack the data sequentially and store offset pointers for each pixel.

I decided that I would implement the linked list approach first, as it seems like a nicer algorithm. Essentially each pixel cell needs to store a pointer to the head of its linked list. This indexes into a global memory array which stores all the fragments that have been processed. Besides storing color and depth, each member of the global memory array also stores an index to the next node in the linked list.

By the end of processing, each pixel on the screen has a corresponding linked list. I drew this diagram to explain how the algorithm works. It requires an atomic counter for node_count (to avoid a slew of memory conflict issues), a 2D uint texture for heads, a 1D vec4 buffer texture for color data, a 1D uint buffer texture for next. The missing value is depth, which may need its own similar texture. The image load, store, and exchange functions of OpenGL 4 are suitable for writing/reading/exchanging data from these textures.

When this procedure is over, each fragment loops over its linked list, sorts the fragments in back to front order, and blends them with the current pixel color (which may contain colors from the opaque objects in the scene).

As for sorting, I plan on using a bubble sort. This is a very easy algorithm to write for a fragment shader and is suitable for the smallish input size (maybe maximum of 30 overlapping transparent fragments?).

There are tons of optimization details that I haven't mentioned, and probably won't implement until I get the basic thing working.

I'm hoping that I'll have something more to show by tomorrow.

Monday, March 12, 2012

Single Pass Order Independent Transparency - the beginning

Transparency is an important visual technique in real time rendering - think water simulations or tinted car windows in games. Unfortunately, one of the main limitations of the Z-buffer is its ability to handle transparency. The Z-buffer stores just the closest fragment for each pixel cell. This approach works fine for opaque scenes as there is no need to remember farther away fragments if we cannot see through the closest fragment. But when transparent objects are present in the scene, the final color of the pixel is dependent on all the transparent fragments. The Z-buffer does not give us the storage space for that.

There are several approaches to cutting around the transparency problem. One method is to sort the transparent surfaces from back to front and render after the opaque object pass. This can be both slow and hazardous. Another method, called depth peeling, avoids the sorting problem but requires multiple rendering passes. If there are many transparent layers, this can become quite slow.

With advances in DirectX 11 and OpenGL 4, single-pass OIT is now possible. ATI has presented a solution involving atomic operations and append buffers where each pixel stores a linked list of its transparent fragments. The image below is a screenshot from a real time demo using Direct X 11.

I hope to implement single-pass order independent transparency in OpenGL and C++, specifically using the atomic operations introduced in OpenGL 4.2. One of my key resources will be the first chapter of Patrick's book OpenGL Insights. The chapter is called Efficient Layered Fragment Buffer Techniques and is written by Pyarelal Knowles, Geo Leach, and Fabio Zambetta. One nice thing is that they tested their implementation on a GTX 460 which is the same card I have.

Their explanation is concise and detailed. While their primary target is transparency, their approach is general enough that it may be applied to other rendering techniques involving multi-fragment storage per pixel. I feel like there is a LOT on the horizon for atomic buffer reading and writing.