GPU3D Log

17/01/2007 - Another fix and update in the ATTILA packages

This time it's a fix for Prey. A bug in the ATTILA OpenGL library was creating an infinite loop when translating some shader code from Prey. The source code has not changed but I have updated the VS2005 libs and the gl2attila builds in the binary package.

ATILA-rei source code (17-01-2007)
ATILA-rei x86 binaries (17-01-2007)

13/01/2007 - A minor update to the ATTILA packages

This update fixes a bug in GLInterceptor when capturing buffers passed as vertex arrays. This bug made traces from Humus' Volumetric Lighting II demo to do not display textures (the array with the texture coordinates was captured incorrectly).

It also enables 'deferred buffers' in the ATILA OpenGL library (gllib or gl2attila) so the panic with UT2004 shouldn't be happening.

I have included a file in the doc/ directory of the source package with the OpenGL API calls that are implemented in the ATTILA OpenGL library. Take into account, however, that those calls may have bugs or may not support all the options and features specified in the OpenGL API. For example our implementation of glInterleavedArrays() only supports the GL_C4F_N3F_V3F format.

ATILA-rei source code (13-01-2007)
ATILA-rei x86 binaries (13-01-2007)

29/12/2006 - Updated ATTILA GPU Simulator binary and source packages

I have updated the the simulator binary and source file packages with a small change that will solve the problems that the previous release had when compiling and using the simulator in Linux machines. First, the Linux binaries are statically linked to avoid conflicts with the different library versions that may exist in the Linux distributions. And second, to allow compiling the simulator in Linux the part of the code that implemented the OpenGL library for ATTILA (linked as the precompiled gllib.a library) has become a stand-alone binary.

Now when you want to simulate an OpenGL trace captured with GLInterceptor you must first use the gl2attila tool to parse this OpenGL trace file and generate an AGP transaction trace file (AGP transactions are the objects that the implementation of the OpenGL library issued to the simulated ATTILA GPU to control the rendering process). The output file (with the name of 'attila.tracefile.gz') generated with this tool is then fed as input to the simulator binary (bGPU or bGPU-Uni). The gl2attila tool can be found in the binary package.

To reduce the size of the AGP trace file I have included support for compressed files using ZLIB. Most Linux distributions, Cygwin too, have ZLIB preinstalled so that shouldn't be problem, but if they don't you will have to obtain it and change the makefiles to point to the directory where it is stored. For VS2005/MSC 8.0 I have include a precompiled version of the ZLIB library in the win32/lib directory and the project files are configure to link the library when building the simulator binaries.

The problem linking with a precompiled OpenGL library for ATTILA (gllib.lib) was not present when compiling with VS2005/MSC 8.0 so I have decided that this build version of the simulator will support both types of traces, the old OpenGL traces and the new AGP transaction traces. The simulator binaries will check which kind of trace file is being passed as input trace file and will automatically select the propper Trace Driver mode.

As a bonus in the doc/ directory in the source code package you will find a couple of unfinished documents about the configuration file parameters and the general architecture and programming manual of ATTILA GPU.

ATILA-rei source code (29-12-2006)
ATILA-rei x86 binaries (29-12-2006)

11/10/2006 - ATTILA GPU Simulator mail list

I have created a Yahoo Group (or mail list) for ATTILA-rei. There you can ask questions or share your knowledge about the simulator. In the strange case there is activity in the list it may become the only reference source about the simulator due to the current lack of documentation (and anyone with free time to write the documentation).

So I ask you to use the list for questions related with the simulator rather than sending private mails to me or other people in the group.

ATTILA GPU Simulator Mail List

Post message: attilasim@yahoogroups.com
Subscribe: attilasim-subscribe@yahoogroups.com
Unsubscribe: attilasim-unsubscribe@yahoogroups.com
List owner: attilasim-owner@yahoogroups.com

05/10/2006 - IISWC paper to be presented October 26th in San Jose, California

Jordi wrote a paper (with help from all the other members of course ;) about the GPU workload of some 'modern' games such as UT2004 and Doom3 (in fact some of the D3D9 games are really recent, what a pity that ATTILA doesn't implement the D3D9 API ... yet, BTW Chema is our new guy who is working to solve this small problem). Prey, which is quite recent but still uses the Doom3 engine, was released just when we were finishing the paper so we couldn't use it even if it works in the simulator. And we haven't bothered yet to solve the problems with Chronicles of the Riddick so we couldn't use it either. The paper was accepted at the IISWC symposium that is hold with ASPLOS. Everyone should be thankful, including me, that this time Jordi, and not me, will be presenting the paper ;).

Workload Characterization of 3D Games

We may even publish a longer more detailed technical report later if we find the time. That could include Prey or other new games.

05/10/2006 - Simulator source update

I had problems compiling the source code release under Visual C 2005 so I'm releasing a small update that seems to solve the problem ... at least for me. I have been thinking if something useful could be done to improve the compatibility under Linux so that gllib.a doesn't suffer from incompatibility when linking with different Linux distributions and base C++ and C libraries but I couldn't come with any good idea. If anyone has a suggestion I will be pleased to try it.

The binaries worked, at least when I tested them ;), but just for 'coherence' I'm also updating the binary package with new versions. The old cygwin binary failed to work with my updated (last week) cygwin installation so I guess it will fail again in a couple of months when they change the cygwin dlls.

ATILA-rei source code
ATILA-rei x86 binaries

21/07/2006 - Simulator source released

So finally I'm releasing the source code for the simulator ... In this release only the source code for the simulator portion of the ATILA simulation framework is being released. The OpenGL tools and library for the ATILA GPU are released only as compiled binaries or libraries. If you need the source for the OpenGL you may ask Roger Espasa, Carlos Gonzalez and Jordi Roca who can claim to have implemented (or paid) for that part of the ATILA framework.

I have been quite busy the last months with my work as an 'overexploited underpaid university professor' (or at least that is what I'm lately complaining about and use as my main excuse for everything) and working on 'something else' that Roger, my PhD advisor, forced on me (at least he paids me for it ...). In any case that means that for half a year ... in fact almost a whole year as I did little in the autumn/winter semester because it was when I started as 'professor' (or whatever is really called this form of explotation) and all the work I had to do to convert into something readable the first two papers we presented ... it has been almost impossible to work in the simulator. And I don't completely agree with the direction the project is taking (or the fact that I'm being forced by the circumstances to stop working on it). For that reasons, and because I decided from the start to release the source code when I finished with it, I'm releasing now this source code.

There is no documentation for now because of the 'excuses' explained above but I will try to work on that this summer and I will be writing documentation about the simulator configuration file, the statistics supported, what is implemented in the ATILA OpenGL library, the ATILA GPU programming manual and the implementation of the simulator (in that order, which means that if by iseptember I have finished the first three it will be quite an unlikely success ...). I will be posting the documentation here. For now if you are interested you can read the ISPASS paper.

Here starts a brief introduction to the features of the released simulator:

ATILA Simulator:

- 'Modern' GPU features circa 2001 (it's actually below ATI R300 feature set)
- Unified shader architecture (at least this can be called 'modern' until next year)
- The whole hardware render pipeline has been implemented:

  1. Vertex Fetch stage
  2. Vertex (or unified) Shading stage
  3. Vertex Post-Shading Cache stage
  4. Primitive Assembly stage
  5. Clipper stage (only reject, not true triangle clipping)
  6. Triangle Setup stage (uses homogeneous rasterization algorithm, can be performed as a shader program)
  7. Fragment Generation stage (uses recursive rasterization algorithm)
  8. Hierarchical Z stage
  9. Fragment 'FIFO' stage (actually the code where the fragments are distributed to different stages)
  10. Z and Stencil test stage (early Z supported)
  11. Attribute Interpolation stage
  12. Fragment (or unified) Shading stage (implements per fragment fog and alpha test as a shader program)
  13. Color Write and Blend stage
  14. DAC (right now only used to output the rendered frame to a PPM file)
  15. Memory Controller
  16. Command Processor (doesn't work like a real command processor would ...)
- Texture compression (DXTC/S3TC)
- Z and color compression
- Anisotropic filtering (my own 'version of ATI and NVidia angle dependant algorithm)
- No full screen antialiasing algorithm has been implemented (which a shame)
- Cycle-accurate (well, somewhat) execution-driven simulation model based on ASIM
- Large number of configuration options supported
- Large number of statistics gathered
- And the best of all: IT REALLY RENDERS THE FRAME


ATILA OpenGL library:

- The ATILA OpenGL library implements a subset of the OpenGL API version 1.5 that is barely able to render the following games: Unreal Tournament 2004, Doom 3, Quake 4, Chronicles of the Riddick and Prey. Nothing else should work unless miracles happen. More than a year ago we tried older games using vertex arrays rather that vertex buffers and it miserably failed.
- Automatized generation of vertex and fragment shader programs for the fixed function T&L and texturing legacy OpenGL API (that was implemented mainly for the small tests we were using at the start of the project and to support UT2004, so anything beyond may fail).
- ARB vertex and fragment program extensions supported.
- ARB vertex and index buffer objects extensions supported.

Links to the source code, binaries compiled for a number of x86 systems and few of OpenGL traces captured to use with the simulator of the player:

ATILA-rei source code
ATILA-rei x86 binaries
Small trace captured from a Doom3 timedemo
Trace captured from Humus Volumetric Lighting II demo
Pack with traces captured from a few small OpenGL applications

09/03/2006 - ISPASS paper

In two weeks I will be presenting a paper introducing the ATTILA GPU architecture, simulator and framework at the ISPASS 2006 Conference in Austin Texas.

ATTILA: A Cycle-Level Execution-Driven Simulator for Modern GPU Architectures

The paper describes (again) the GPU architecture we have implemented with the simulator without going into much detail and a description (the new stuff) of the simulation methodology used in the simulator and details of the implementation (the C++ class hierarchy, the relation between the different classes and so). The last part is a small experiment 'inspired' in the RV530 architecture that is used just as an excuse to present some of the stadistics and graphics we can obtain with our simulator. The implementation of the 'fragment queue' execution mode of fragments in the shaders (or in another words execution as a single 'batch/thread' of fragments) had a big bug at the time so you can just ignore the results for it.

I have wanted to write something like a programming or reference manual of our simulator architecture with much more detailed information (for example with the hardware registers used to program the GPU, the description of the signals between the simulator boxes and a brief description of the algorithms used in each boxes) but I have been way too busy lately. When (if ever) I get to do it I will put the document online here (in the improbable case that I have a lot of free time I may even write a proper page for the ATTILA simulator rather than using this lame blog). It may also signal that the time for other releases may be near ;).

The current state of the simulator is that we fully support (but for some glCopyTexImage related functionality that if I had a couple of free days I would implement) three main OpenGL games: UT2004, Doom 3 and Quake 4. Chronicles of Riddick is working with some graphic glitches (it's more or less in the same state that Doom 3 and Quake 4 were at the time of the previous post). And other than may be Serious Sam 2 or any new big release of an OpenGL game that may happen in the future (Quake Wars?) we aren't likely to actively work to support more OpenGL games.

On the other hand, as a middle term project we have already started to write the framework for Direct3D. But I don't expect anything 'big' working until the end of the current year.

21/11/2005 - Slides for the Micro-38 and HiPEAC presentations.

Those are the slides I used for the two papers I presented last week in the two consecutive Barcelona conferences. I wonder if someone really understood a word of what I said ...

Note: I think that the last time I really tried to speak in english was in a small presentation about binary translation in 2001-2002 and in that case I think nobody understood a word :).

Shader Performance Analysis on a Modern GPU Architecture

A Single (Unified) Shader GPU Microarchitecture for Embedded Systems

As the slides show (all the screenshots are generated by ATTILA but the one marked as rendered in a GeForce FX 5900) the simulator already supports UT2004, Doom 3 and Quake 4 (with some 'minor' visual differences that we are still hunting). Early next year it's likely we will be supporting a few more (Riddick, Serius Sam 2) that use some of the still unimplemented or bugged features in our OpenGL framework (state save, vertex arrays).

15/09/2005 - ATILA (surprisingly) accepted papers.

At least a half of the papers in the Micro-38 program seem to be already online so ...
And, on a second thought, even if the HiPEAC program isn't online I doubt there is any problem.
I would advice to disregard the titles and take both papers as a way to introduce the simulator ...

Errata: In the Micro-38 paper it says that the DDR channels are 32-bit wide. Wrong. They are 64-bit ;). I didn't find that until too late :P.

Shader Performance Analysis on a Modern GPU Architecture
Victor Moya, Carlos Gonzalez, Jordi Roca, Agustin Fernandez, Roger Espasa
To be presented at Micro-38, Barcelona, November 16th.

A Single (Unified) Shader GPU Microarchitecture for Embedded Systems
Victor Moya, Carlos Gonzalez, Jordi Roca, Agustin Fernandez, Roger Espasa
To be presented at HiPEAC, Barcelona, November 17-18th.

20/06/2005 - Simulator Demo.

I was finally able to compile a working version of the simulator for win32. The number of parameters in the configuration file has been reduced for simplicity reasons as most changes in the original file would just make the simulator to either crash or don't work properly (not enough parameter checking ;). The tools GLInterceptor and GLPlayer are also included, but don't expect them to work on most applications (that includes most OpenGL games) using vertex arrays.

The binary package is here.

A trace for Humus Volumetric Lighting II demo is here.

And a modified UT2004 trace from PrimeVal map is here (removed unsopported NV_COMBINE4 and buffer delete calls).

The volumetric trace should work without problems but some frames in the UT2004 trace may not work or display rendering errors (try start frames 30, 60, 100 and 120).

The simulator was implemented to be as slow as possible ;) so expect multi minute or multi hour rendering times. Frames between 60 and 80 in the UT2004 are the faster region, something like 10-20 minutes per frame on a P4 2.8 GHz and a 4vs 8fs configuration.

09/06/2005 - Future updates.

As a deadline was delayed I have to prepare an unexpected paper right now (and I wonder if that was luck or not ...). So I guess that any new update will have to be delayed at least a week. I should be writing some presentation pages soon though. As the simulator compiles on win32 I may add a demo binary or something even as the OpenGL trace capturing tools and library are still quite limited and won't work for most complex applications. The traces are in text format so even without the trace capturing tools anyone can write silly test traces. In any case it won't be much more than an even slower and buggy version of a 'reference' rasterizer.

31/05/2005 - Anisotropic filtering.

I have been trying to implement anisotropic filtering on the simulator Texture Unit. As usually there isn't that much information about how current GPUs implement it. With a bit of help from the anisotropic filtering extension I was able to find an algorithm that may resemble that used by ATI. If I have time in the next (months?) I will try to discover the algorithm that NVidia uses. As more complex (and likely hardware expensive) approaches there is EWA but I don't plan to implement something similar yet.

Some screenshots. The trace was generated from Xmas texture filtering test application. The ATILA screenshots were generated by the simulator and the R350 and NV35 screenshots were taken with our GLPlayer tool.

ATILA 16x AF

ATILA 16x AF colored mipmaps ATILA 16x AF

ATI R350 16x AF

R350 16x AF colored mipmaps R350 16x AF

NVidia NV35 8x AF

NV35 8x AF colored mipmaps NV35 8x AF

I guess that the next one should be multisampling support but I'm not yet sure on the compression algorithm. A 2x or 4x MSAA implementation without compression wouldn't be that hard to implement in the simulator.

13/05/2005 - Rejected Graphics Hardware 2005 paper

The paper I presented for this year Graphics Hardware can be only described as half a paper (agreeing with some of the reviews). If someone is interested on how a GPU pipeline looks like it can be interesting. Some of the references are really interesting for those who don't know them too. I will likely put up some of the results that were planned to go in that paper some time later. How does work an unified architecture for geometry limited cases? Well, for a two shader unified architecture you have the equivalent of 8 vertex shaders so you can guess. A public release? If you don't mind the lack of documentation, compatibility or any kind of support you can try to ask nicely the right person for it.

23/09/2004 - Even more test results

I continue spending my time in the attempt to discover what is the ATI Radeon 9800 (R350) texture cache architecture. I have calculated the texel displacements required to sample the texel positions I want with 32x32:1 minification ratio. They are 17.01 for the horizontal axis and 17.064 for the vertical axis if I want to sample the two first texels for those axis. For different ratios may change but I think they should be around multiples of those numbers. Don't ask me if those numbers make sense or what they mean. For the vertical axis the sampling point is still not correct for the whole screen quad in the lower left corner, lower octave of the screen and upper right corner following the quad two triangles division line. But is the best I could get. Someday I should try to work the maths that could provoque such behaviour.

Using that I tried to test the 'worst' case of a quad requesting 16 texels each from a different cache line (samples from the corner texels of the cache lines). But the penalty is still the same at 1/4 for horizontal and 1/2 for vertical ratios and 1/4 for cross ratios. Still trying to figure what is the limitation.

I also discovered that the actual cache line size is 8x8 x 4 bytes, or 256 bytes, the size of a DDR memory transaction. Will try to work around it. I already had a clue about being 8 in the horizontal axis when horizontal ratio 4 had only a 1/2 penalty while 8 and further ratios had 1/4 penalty.

21/09/2004 - More test results

The new results for the ATI 9800 (R350) using separate horizontal, vertical and cross fragment to texel ratios show that for any vertical ratio the fillrate is 80% (vertical ratio 1:4) and 73% (for verticals ratios 1:8, 1:16, 1:32) compared with the maximum fillrate (experimental, theorical should be 10% - 20% larger). For the horizontal ratios the results show a 53% for 4:1 and 28% for 8:1, 16:1 and 32:1. The cross ratios are the ones we tested in previously. Results are 42% for 4:4 and 33% for 8:8, 16:16 and 32:32. Those results are using NEAREST sampling (point sampling).

For bilinear sampling (LINEAR) the vertical results is 53% for all the ratios (1:4, 1:8, 1:16, 1:32). Horizontal results are 53% for 4:1, and 28% for 8:1, 16:1 and 32:1. The cross result is 28% for all ratios.

All lower ratios have no penalties.

21/09/2004 - More articles

The Hakura and Gupta article analying texture cache architecture shows the following points (also summarized in Akeley and Hanrahan course):

- Blocking of texture data reduces misses.
- Tiled rasterization with blocked texture data reduces misses.
- Use Morton order (Z order) to interleave cache texels in 4 cache banks. Allows to access 4 bilinear samples without conflicts in a single cycle.
- Use same order with other blocking levels.
- 6D blocking level can be used to create superblocks of the size of the cache. This reduces the working set.
- For mipmapping use two separate caches (odd and even mipmap levels) or 2-way associative cache.
- A 16 to 32 KB cache allows a 95% hit rate and 5-10:1 reduction of bandwidth.

The Kekoa and others paper about prefetching shows also numbers about the different buffer (FIFOs) sizes and their effect in performance for a number of memory system configurations. The basic architecture is based in the previous article. They use a 16 KB cache divided in two direct mapping caches for odd and even mipmap levels. Allows trilinear in a single cycle. Uses four independent banks per cache to allow single cycle bilinear.

20/09/2004 - Texture Cache implementation paper

The paper by a group of Koreans to the IEEE Journal of Solid State Circuits about the implementation of a two level parallel texture cache doesn't solves the problem of what happens when you are sampling from two different cache lines. The architecture proposed has a 8 MB L2 and 8 (or 8-way as they call it) 16 KB L1. The cache lines are reconfigurable from a 4x4 block to a 8x8 or a 16x16 block (that is, it is basically 4 blocks of 4x4 texels that can be accessed as a single line). The caches and the pipeline is divided in to mirror sections for odd and even LOD mipmaps to allow single cycle trilinear filtering. The L2 to L1 bus is 256 bits wide and double pumped and they claim 75 GB/s bandwidth at 150 MHz (they use a 0.16 process). The eight parallel L1 caches and filtering pipelines allow to process indepently eight pixels. The paper is interesting from the hardware implementation point of view as it describes all the texture cache pipeline including filtering.

20/09/2004 - Experimental results

After testing the ATI Radeon 9000 (home), ATI Radeon 9600, ATI Radeon 9800 and NVidia GeForceFX 5900 for their texture cache behaviour I'm not still sure what architecture are they implementing. The tests include one for determining the size of the texture cache (and levels) and another trying to discover the behaviour when different texel to pixel ratios and texel displacements are used. The texel ratio goes from 1:1 to 4:1 in both point and bilinear sampling mode (as we already know that trilinear takes always two cycles in current architectures, I think only the old R100 with two texture units in tandem supported true single cycle trilinear).

The results show that the R9000 seems to have a 2KB texture cache and all kind of tested accesses (1:1 to 4:1 with and without bilinear) are performed in a single cycle (or at least with single cycle throughput). The R9600 and the R9800 share the same architecture likely duplicating the fragment quad processor architecture and adding may be some additional hardware (Hierarchical Z Buffer, larger L2?). They both have 8 KB texture caches and some limitations in the texture access for 2:1 ratio (only when the texture coordinates are displaced (1,1) texels, or any odd multiple of it). That reduces the texture cache output to around a 60% to 50% of its maximun througput. For ratio 4:1 the texture througput is reduced to a 30% - 25 % (point sampling) and 25% - 20% (bilinear sampling). I'm still trying to figure what kind of texture cache architecture would produce those limitations. The NVidia GeForceFX 5900 has a 4 KB texture cache and it doesn't seem to have any limitations.

ATI may be decided to duplicate the texture cache size (R3xx architecture), even if adding somekind of undetermined limitation (reduce the number of addressable cache lines per cycle?) because for the common case of using mipmapping the ratio is always around 1:1 and never reaches 1:2. For R2xx architecture and NV3x architectures the reduced size of the texture cache may allow to perform any combination of the 16 texture fetches required for a quad bilinar sample.

14/09/2004 - More Texture caches

There are more points to take into account. To support multitexture (multiple textures accessible from the same fragment) there may be cache architectures better than others. In the ideal scenary of a single texture used by all the fragments the tags and addresses used to access the cache are limited to a given memory region and if the ideal (good rasterization order, good texture layout in memory) scenery there shouldn't be many conflict misses. However as we move into multitexture (up to 16 in current hardware) conflict misses may start to appear if the texture caches are small. How does affect associativity and the number of active textures? May using the active texture ID for addressing in the cache reduce conflict misses?.

On a different topic I have read NVidia patent about aggregation of texture requests and is basically a 'clever' implementation with two level of caches (color for formated data, raw for unformated data) that tries to reduce the address traffic deriving the 4 texel addresses for each fragment from a single texel address. It also tries to group accesses to raw data inside the same line. The described implementation seems somewhat limited (and it is relatively old) and seems only to provide for 4 lines of 16 texels for the color cache. I'm not still sure what are the 4 RAMs in the color cache as the description isn't very clear (after all is a patent). But seems to confirm my suspicion that each RAM corresponds with 1 texel color for each of the texels in a 2x2 quad so a bilinear access (or 4 in the described implementation) can be performed in just one cycle if there is no miss for all cases.

The NVidia patent of prefetch seems a bit different from what I thought it was. Presents a mechanism that provides 'hidden' cache lines and 2 indirections to the cache lines to limite the number stalls because of lines still being 'reserved' in the request FIFO. It also discusses other types of stalls and problems and how they propose to resolve them. I haven't read too much on it as it isn't my primary concern.

The ATI patent on compressed texture data in the cache doesn't seems to have any interesting information and I just skimmed over it.

13/09/2004 - Design of a Texture Unit

Objectives:

- 1 bilinear sample per cycle and fragment pipeline
- 1 trilinear sample each two cycles per fragment pipeline
- work unit is lockstep 2x2 fragment quad
- lod derivatives calculated from adjoint fragments in the quad

Requeriments:

- 4 to 16 texel address address must be calculated per cycle and quad
- 1 lod value must be calculated per cycle and one or two mipmap levels selected
- 16 texels must be read from the texture cache
- 4 bilinear samples must calculated from 16 texels per cycle
- must support loopback for trilinear

Additional objectives:

- Support compressed textures. - Support different texture formats. - Support random access of unfiltered general data types.

13/09/2004 - Documents about Texture Unit and Cache Architecture

Apparatus and method for grouping texture cache requests

A 2003 patent, submitted in 2001, and based on another patent (seems to be exactly the same patent word by word and figure by figure) filed in 1998 by NVidia.

Describes a texture mechanism that is able to aggregate texture accesses from per fragment texture requests from the fragment pipeline. There are two separate caches: color cache and raw cache. The raw cache seems to store uncompressed and/or unformated data read from memory. The color cache seems to stored the compressed formated data that is feed to the texture filtering unit. The texture cache hardware accepts up to 4 requests (2x2 fragment quad?) and accesses up to 16 texels. The texture data is organized in 4x4 blocks and 16x16 superblocks. The color cache is divided in 4 banks (or caches?) with 5 ports each (4 read, 1 write).

Further reading is required for this document.

A Reconfigurable Multilevel Parallel Texture Cache Memory With 75-GB/S Parallel Cache Replacement Bandwidth

This paper from the May 2002 IEEE Journal of Solid State Circuits presents the hardware implementation of a two level (L1 and L2) texture cache that supports trilinear filtering.

Seems to be a good source of hardware related information and algorithms used.

Further reading is required for this document.

Cache memory for high latency and out-of-order return of texture data

Patent from Microsoft (bought?) filed in 1999.

Not read.

The Design and Analysis of a Cache Architecture for Texture Mapping

Paper from ISCA 1997 analyzes the archicteture of a texture cache and presents miss rates for different cache parameters: line size, block size, cache size, associatiavity or rasterization order.

Seems a good introduction to texture caches and presents some results that could be useful when we select our texture cache parameters.

Partially skimmed. More read required when we choose the cache parameters.

Prefetching in a Texture Cache Architecture

Paper from Kekoa Proudfoot and others from Stanford University. Presents a prefetching architecture for texture caches (my ideas for the 'fetch' cache are based or related with this architecture). The texture cache they presents uses 6D blocking for the texture data in memory.

Key paper to read.

Multi-Level Texture Caching for 3D Graphics Hardware

Presents a multilevel texture cache architecture and presents results for L1 and L2 texture caches.

Not read.

Parallel Texture Caching

Paper from 1999 Eurographics written by Hanrahan and others from the Stanford University. Presents a parallel texture cache architecture. Presents 6D texture blocking and other basic concepts. Analyses how the number of texture units and parallelism affect the performance.

Not read.

Evaluation of High Performance Multicache Parallel Texture Mapping

From ICS 1998. Introduces basic concepts, analyses parallelism in 3D texture rendering.

Not read.

Method and Apparatus for a Compressed Texture Caching ina Video Graphic System

Patent from ATI, filed in 1999. Describes a texture unit arquitecture where the texture unit keeps compressed data that is only uncompressed before sending the sampled data to the filter unit using 4 decompression blocks.

Not read. May be interesting when we study implementation of texture compression.

Circuit and Method for Prefetching Data for a Texture Cache

NVidia patent filed in 2000. Seems similar to the fetch cache architecture where counters are used to reserve lines for pending requests. The figures show 4 cache banks to generate 4 texels per cycle.

Not read. Not clear if it has any interest.

Mail: Victor Moya Last modified: January 17th, 2007.