Game Coding Complete
Did you ever hear the joke about the programmer trying to beat the Devil in a coding contest? Part of his solution involved overcoming a memory limitation by storing a few bytes in a chain of soundwaves between the microphone and the speaker. That's an interesting idea, and I'll bet we would have tried that one on Ultima VII had someone on our team thought of it.
Memory comes in very different shapes, sizes, and speeds. If you know what you're doing you can write programs that make efficient use of these different memory blocks. If you believe that it doesn't matter how you use memory, you're in for a real shock. This includes assuming that the standard memory manager for your operating system is efficient; it usually isn't and you'll have to think about writing your own.
Understanding the Different Kinds of Memory
The system RAM is the main warehouse for storage as long as the lights are on. Video RAM or VRAM is much smaller and is specifically used for storing objects that will be used by the video card. On top of it all, virtual memory hides a hard disk behind your lightning fast system RAM, and if you're not careful a simple memcpy() could cause the hard drive to seek. You might as well pack up and go to lunch if this happens.
System RAM
Your system RAM is a series of memory sticks that are installed on the mother board. Memory is actually stored in nine bits per byte, with the extra bit used to catch memory parity errors. Depending on the OS, you get to play with a certain addressable range of memory. The operating system keeps some to itself. Of the parts you get to play with, it is divided into three parts when your application loads:
-
Global memory: This memory never changes size. It is allocated when your program loads and stores global variables, text strings, and virtual function tables.
-
Stack: This memory grows as your code calls deeper into core code and it shrinks as the code returns. The stack is used for parameters in function calls and local variables.
-
Heap: This memory grows and shrinks with dynamic memory allocation. It is used for persistent objects and dynamic data structures.
Old timers used to call global memory the DATA segment, harkening back to the days when there used to be near memory and far memory. It was called that because programmers used different pointers to get to it. What a disgusting practice! How I miss 16-bit segmented memory architectures. Not! Everything is much cleaner these days because every pointer is a full 32 bits. (Don't worry, I'm not going to bore you with the "When I went to school I used to load programs from a linear access tape cassette" story.)
Your compiler and linker will attempt to optimize the location of anything you put into the global memory space based on the type of variable. This includes constant text strings. Many compilers, including Visual Studio, will attempt to store text strings only once to save space:
const char *error1 = "Error"; const char *error2 = "Error"; int main() { printf ("%x\n", (int)error1); // How quaint. A printf. printf ("%x\n", (int)error2); return 0; }
This code yields interesting results. You'll notice that under Visual C++, the two pointers point to the same text string in the global address space. Even better than that, the text string is one that was already global and stuck in the CRT libraries. It's as if we wasted our time typing "Error." This trick only works for constant text strings, since the compiler knows they can never change. Everything else gets its own space. If you want the compiler to consolidate equivalent text strings, they must be constant text strings.
Don't make the mistake of counting on some kind of rational order to the global addresses. You can't count on anything the compiler or linker will do, especially if you are considering crossing platforms.
On most operating systems, the stack starts at high addresses and grows towards lower addresses. C and C++ parameters get pushed onto the stack from right to left—the last parameter is the first to get pushed onto the stack in a function call. Local parameters get pushed onto the stack in their order of appearance:
void testStack(int x, int y) { int a = 1; int b = 2; printf("&x= %-10x &y= %-10x\n", &x, &y); printf("&a= %-10x &b= %-10x\n", &a, &b); }
This code produces the following output:
&x= 12fdf0 &y= 12fdf4 &a= 12fde0 &b= 12fdd4
Stack addresses grow downward to smaller memory addresses. Thus it should be clear that the order in which the parameters and local variables were pushed was: y, x, a, and b. The next time you're debugging some assembler code you'll be glad to understand this, especially if you are setting your instruction pointer by hand.
C++ allows a high degree of control over the local scope. Every time you enclose code in a set of braces you open a local scope with its own local variables:
int main() { int a = 0; { // start a local scope here... int a = 1; printf("%d\n", a); } printf("%d\n", a); }
This code compiles and runs just fine. The two integer variables are completely separate entities. I've written this example to make a clear point, but I'd never actually write code like this. Opening up a local scope just to reuse a variable name is something they shoot programmers for down here in Texas. The real usefulness of this kind of code is for use with C++ objects that perform useful tasks when they are destroyed—you can control the exact moment a destructor is called by closing a local scope.
A singleton manager or global variable manager is a good example. Global variables are a pain in the butt, but are sometimes necessary evils. If you've ever programmed a game using a state based 3D renderer you use global variables every time to change the render state. The code tends to keep temporary variables around to store the old state of the global variables. After you are finished with the global variables, you'll use the value of the temporary variables to set them back to their old state. This kind of programming practice is rampant in Windows GDI code. Lucky for you there's a better way, using C++ destructors to cleanly set and reset globals:
//------------------------------------- // singleton template manages setting/resettting global variables. //------------------------------------- template <class T> class singleton { T m_OldValue; T* m_pGlobalValue; public: singleton(T newValue, T* globalValue) { m_pGlobalValue = globalValue; m_OldValue = *globalValue; *m_pGlobalValue = newValue; } ~singleton() { *m_pGlobalValue = m_OldValue; } };
Instead of setting the global variable directly, use this template class. The constructor sets the value of the global after it saves the old value in a member variable. The destructor puts the global back to the original value.
Here's an example:
void main() { extern bool g_FavorVRAMSurfaces; // We'll use the default favored VRAM MakeSomeTextures(); { // Now put the textures in system memory singleton<bool> useSystemRAM(false, &g_FavorVRAMSurfaces); MakeSomeTextures(); } }
The example shows how the singleton template can be used in conjunction with the scoping rules of C++ to keep track of a global variable. The global, g_FavorVRAMSurfaces, is set to true whenever the fictional texture manager should favor allocating VRAM instead of system RAM. The code inside the main function shows how each call to MakeSomeTextures() can use the default value of the global or a specific value.
Video Memory (VRAM)
Video RAM is the memory installed on your video card, unless we're talking about an Xbox. Xbox hardware has unified memory architecture or UMI, so there's no difference between system RAM and VRAM. It would be nice if the rest of the world worked that way. Other hardware such as the Intel architectures must send any data between VRAM and system RAM over a bus. The PS2 has even more different kinds of memory. There are quite a few bus architectures and speeds out there and it is wise to understand how reading and writing data across the bus affects your game's speed.
As long as the CPU doesn't have to read from VRAM everything clicks along pretty fast. If you need to grab a piece of VRAM for something the bits have to be sent across the bus to system RAM. Depending on your architecture, your CPU and GPU must argue for a moment about timing, stream the bits, and go their separate ways. While this painful process is occurring, your game has come to a complete halt.
The hard disk can't write straight to VRAM, so every time a new texture is needed you'll need to stop the presses, so to speak. The smart approach is to load up as many new textures as you can, hopefully limiting any communication needed between the CPU and the video card.
Best Practice | Never perform per pixel operations on data stored in VRAM. If you can, keep a scratch copy in system RAM. If you must do something weird to VRAM, copy the whole thing into system RAM and perform the operation there. When you're done, copy it back to VRAM in one shot, hopefully using an asynchronous copy if your graphics library supports it. Under DirectX, the NO_WAIT flag is the ticket for an asynchronous Blt(). The exception to this rule: If your game can require the latest video cards, most per pixel operations can be programmed in pixel shaders. |
If you've been paying attention you'll realize that the GPU in your video card is simply painting the screen using the components in VRAM. If it ever has to stop and ask system RAM for something, your game won't run as fast as it could. Of course, if the CPU never sent anything different for the video card to draw, your game would be pretty boring, unless you like watching images that never change.
A Tale from the Pixel Mines | The first texture manager I ever wrote was for Ultima IX. That was before the game was called Ultima: Ascension. I wrote the texture manager for 3DFx's Glide API, and I had all of an hour to do it. We wanted to show some Origin execs what Ultima looked like running under hardware acceleration. Not being the programmer extraordinare, my algorithm had to be pretty simple. I chose a variant of LRU, but since I didn't have time to write the code to sort and organize the textures, I simply threw out every texture in VRAM the moment there wasn't any additional space. I think this code got some nomination for the dumbest texture manager ever written, but it actually worked. The Avatar would walk around for ninety seconds or so before the hard disk lit up and everything came to a halt for two seconds. I'm pretty sure someone rewrote it before U9 shipped. At least, I hope someone rewrote it! |
Optimizing Memory Access
Every access to system RAM uses a CPU cache. If the desired memory location is already in the cache, the contents of the memory location are presented to the CPU extremely quickly. If, on the other hand, the memory is not in the cache, a new block of system RAM must be fetched into the cache. This takes a lot longer than you'd think.
A good testbed for this problem uses multidimensional arrays. C++ defines its arrays in row major order. This ordering puts the members of the right most index next to each other in memory.
TestData[0][0][0] and TestData[0][0][l] are stored in adjacent memory locations.
Gotcha | Not every language defines arrays in row order. Some versions of PASCAL define arrays in column order. Don't make assumptions unless you like writing slow code. |
If you access an array in the wrong order, it will create a worst case CPU cache scenario. Here's an example of two functions that access the same array, and do the same task. One will run much faster than the other:
const int g_n = 250; float TestData[g_n][g_n][g_n]; inline void column_ordered() { for (int k=0; k<g_n; k++) for (int j=0; j<g_n; j++) for (int i=0; i<g_n; i++) TestData[i][j][k] = 0.0f; } inline void row_ordered() { for (int i=0; i<g_n; i++) for (int j=0; j<g_n; j++) for (int k=0; k<g_n; k++) TestData[i][j][k] = 0.0f; }
The timed output of running both functions on my test machine showed that accessing the array in row order was nearly nine times faster:
Column Ordered=2817 ms Row Ordered=298 ms Delta=2519 ms
Any code that accesses any largish data structure can benefit from this technique. If you had a multistep process that affected a large data set, try to arrange your code to perform as much work in smaller memory blocks. You'll optimize the use of the L2 cache and make a much faster piece of code.
Memory Alignment
The CPU reads and writes memory-aligned data much faster than other data. Any N-byte data type is memory aligned if the starting address is evenly divisible by N. For example, a 32-bit integer is memory aligned if the starting address is 0x04000000. The same 32-bit integer is unaligned if the starting address is 0x04000002, since the memory address is not evenly divisible by 4 bytes.
You can perform a little experiment in memory alignment and how it affects access time by using example code like this:
#pragma pack(push, 1) struct ReallySlowStruct { char c : 6; __int64 d : 64; int b : 32; char a : 8; }; struct SlowStruct { char c; __int64 d; int b; char a; }; struct FastStruct { __int64 d; int b; char a; char c; char unused[2]; }; #pragma pack(pop)
I wrote a piece of code to perform some operations on the member variables in each structure. The difference in times is as follows:
Really slow=417 ms Slow=222 ms Fast=192 ms
Your penalty for using the SlowStruct over FastStruct is about 14% on my test machine. The penalty for using ReallySlowStruct is code that runs twice as slow.
The first structure isn't even aligned properly on bit boundaries, hence the name ReallySlowStruct. The definition of the 6-bit char variable throws the entire structure out of alignment. The second structure, SlowStruct, is also out of alignment but at least the byte boundaries are aligned. The last structure, FastStruct, is completely aligned for each member. The last member, unused, ensures that the structure fills out to an eight-byte boundary in case someone declares an array of FastStruct.
Notice the #pragma pack(push, 1) at the top of the source example? It's accompanied by a #pragma pack(pop) at the bottom. Without them, the compiler, depending on your project settings, will choose to spread out the member variables and place each one on an optimal byte boundary. When the member variables are spread out like that the CPU can access each member quickly, but all that unused space can add up. If the compiler were left to optimize SlowStruct by adding unused bytes each structure would be 24 bytes instead of just 14. Seven extra bytes are padded after the first char variable, and the remaining bytes are added at the end. This ensures the entire structure always starts on an eight byte boundary. That's about 40% wasted space, all due to a careless ordering of member variables.
Don't let the compiler waste precious memory space. Put some of your brain cells to work and align your own member variables. You don't get many opportunities to save memory and optimize CPU at the same time.
Virtual Memory
Virtual memory increases the addressable memory space by caching unused memory blocks to the hard disk. The scheme depends on the fact that even though you might have a 500Mb data structure, you aren't going to be playing with the whole thing at the same time. The unused bits are saved off to your hard disk, until you need them again. You should be cheering and wincing at the same time. Cheering because every programmer likes having a big memory playground, and wincing because anything involving the hard disk wastes a lot of time and, last time I checked, PS2s and GameCubes didn't sport harddrives out of the box.
Just to see how bad it can get, I took the code from the array access example and modified it to iterate through a three dimensional array 500 elements cubed. The total size of the array would be 476Mb, much bigger than the installed memory on the test machine. A data structure bigger than available memory is sometimes called out-of-core. I ran the column_ordered() function and went to lunch. When I got back about 30 minutes later the test program was still chugging away. The hard drive was seeking like mad, and I began to wonder whether my hard disk would give out. I became impatient and re-ran the example and timed just one iteration of the inner loop. It took 379.75 seconds to run the inner loop. The entire thing would have taken over 50 hours to run. I'm glad I didn't wait.
Remember that the original array, 250 elements cubed, ran the test code in 298 ms when the fast row_ordered() function was used. The large array is only 8 times bigger, giving an expectation that the same code should have run in 2384 ms, or just under two and a half seconds.
Compare 2384 ms with 50 hours and you'll see how virtual memory can work against you if your code accesses virtual memory incorrectly.
Gotcha | Any time a cache is used inefficiently you can degrade the overall performance of your game by many orders of magnitude. This is commonly called "thrashing the cache" and is your worst nightmare. The solution to this problem might be as simple as reordering a few lines of code. It might also be as heinous as reducing your data set. |
Using Memory Mapped Files
A memory mapped file maps a virtual addressable memory space onto a file. The entire contents of the file can be read and written as easy as if it were entirely in memory. Win32 and Linux both support memory mapped files. Instead of writing your own low level memory cache, use memory mapped files to load any resources or data into your game. Since Windows uses memory mapping to load .EXE and .DLL files, it's a fair bet that the algorithm is fairly efficient. These files are especially useful for larger data sets that come in groups of 512Kb. Don't use these files for anything smaller. Check your system for the default size of a virtual memory page—that is the smallest chunk of memory a memory mapped file manipulates.
Under Win32, you can open a read-only memory mapped file with this code:
// // Open a memory mapped file // // Please forgive the lack of error checking code, // you should always check return values. HANDLE fileHandle = CreateFile( fileName, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, // the file must already exist 0, NULL); // // Create a mapping object from the file // HANDLE fileMapping = CreateFileMapping( fileHandle, NULL, PAGE_READONLY, 0, 0, NULL); // // Map the file to memory pointer // char *fileBase = (char *) MapViewOfFile( fileMapping, FILE_MAP_READ, 0, 0, 0);
Once you have a valid memory map, you can read from the file by accessing the variable, fileBase, as shown here.
// Read the first 50 bytes and copy them to dest. memcpy(dest, fileBase, 50);
The operating system doesn't copy the entire contents of the file into memory. Instead, it uses the same code used to manage virtual memory. The only thing that is different is the name of the file on secondary storage.
Gotcha | Don't memory-map a file if it exists on CD-ROM. If the CD is ejected, your memory map will become invalid. If the CD is ejected while data is being read, your application will suffer a crash. Sadly, the crash will happen before Windows has a chance to send the WM_DEVICECHANGE message, which is always sent when the CD is ejected. |
A Tale from the Pixel Mines | An alternative to MapViewOfFile, MapViewOfFileEx, has an additional parameter that allows a programmer to specify the exact base address of the memory map. Ultima IX's original resource cache made use of this parameter, but in a very bad way. I know because I wrote it. Each resource file assumed that it would exist at a specific base address. This allowed me to code data structures, including pointers, directly as data in the memory mapped file. This seemed like a great idea until I realized Ultima IX would no longer run on Windows NT. It turns out that the allowable base address range was completely different from Windows 95. I knew when I coded the thing that it was a dream too good to be true. I eliminated the base address specific assumptions from the code and data files and it worked just fine. |
Writing Your Own Memory Manager
Most games extend the provided memory management system. The biggest reasons to do this are performance, efficiency, and improved debugging. Default memory managers in the C runtime are designed to run fairly well in a wide range of memory allocation scenarios. They tend to break down under the load of computer games, where allocations and deallocations of relatively tiny memory blocks can be fast and furious.
Gotcha | You shouldn't purposely design your game to create a worst case scenario for the memory manager in the C runtime. I've known a few programmers who find it a fun and challenging task to write their own memory manager. I'm not one of them, and you should also resist the urge. The memory manager in the C runtime doesn't suck completely, and it has gone through more programming time and debugging than you'll have even if your game is in development for a number of years. Instead, you'll want to write a system that is either much simpler in design and features, or you will choose to simply extend the memory manager that is already there. |
A Tale from the Pixel Mines | Ultima VII: The Black Gate had a legendary memory manager: The VooDoo Memory Management System. It was written by a programmer that used to work on guided missile systems for the Department of Defense, a brilliant and dedicated engineer. U7 ran in good old DOS back in the days when protected mode was the neat new thing. VooDoo was a true 32-bit memory system for a 16-bit operating system, and the only problem with it was you had to read and write to the memory locations with assembly code, since the Borland compiler didn't understand 32-bit pointers. It was done this way because U7 couldn't really exist in a 16-bit memory space—there were atomic data structures larger than 64Kb. For all its hoopla VooDoo was actually pretty simple, and it only provided the most basic memory management features. The fact that it was actually called VooDoo was a testament to the fact that it actually worked; it wasn't exactly supported by the operating system or the Borland compilers. VooDoo MM for Ultima VII is a great example of writing a simple memory manager to solve a specific problem. It didn't support multithreading, it assumed that memory blocks were large, and finally it wasn't written to support a high number or frequency of allocations. |
A standard memory manager, like the one in the C runtime, must support multithreading. Each time the memory manager's data structures are accessed or changed they must be protected with critical sections, allowing only one thread to allocate or deallocate memory at a time. All this extra code is time consuming, especially if you malloc and free very frequently. Most games are multithreaded to support sound systems, but don't necessarily need a multithreaded memory manager for every part of the game. A single threaded memory manager that you write yourself might be a good solution.
Simple memory managers can use a doubly linked list as the basis for keeping track of allocated and free memory blocks. The C runtime uses a more complicated system to reduce the algorithmic complexity of searching through the allocated and free blocks that could be as small as a single byte. Your memory blocks might be either more regularly shaped, fewer in number, or both. This creates an opportunity to design a simpler, more efficient system.
Default memory managers must assume that deallocations happen approximately as often as allocations, and they might happen in any order and at any time. Their data structures have to keep track of a large number of blocks of available and used memory. Any time a piece of memory changes state from used to available the data structures must be quickly traversed. When blocks become available again, the memory manager must detect adjacent available blocks and merge them to make a larger block. Finding free memory of an appropriate size to minimize wasted space can be extremely tricky. Since default memory managers solve these problems to a large extent, their performance isn't as high as another memory manager that can make more assumptions about how and when memory allocations occur.
If your game can allocate and deallocate most of its dynamic memory space at once, you can write a memory manager based on a data structure no more complicated than a singly linked list. You'd never use something this simple in a more general case, of course, because a singly linked list has O(n) algorithmic complexity. That would cripple any memory management system used in the general case.
A good reason to extend a memory manager is to add some debugging features. Two features that are common include adding additional bytes before and after the allocation to track memory corruption, or to track memory leaks. The C runtime adds only one byte before and after an allocated block, which might be fine to catch those pesky x+1 and x-1 errors but doesn't help for much else. If the memory corruption seems pretty random, and most of them sure seem that way, you can increase your odds of catching the culprit by writing a custom manager that adds more bytes to the beginning and ending of each block. In practice the extra space is set to a small number, even one byte, in the release build.
Gotcha | Anything you do differently from the debug and release builds can change the behavior of bugs from one build target to another. Murphy's Law dictates that the bug will only appear in the build target that is hardest, or even impossible, to debug. |
Another common extension to memory managers is leak detection. MFC redefines the new operator to add __FILE__ and __LINE__ information to each allocated memory block in debug mode. When the memory manager is shut down all the unfreed blocks are printed out in the output window in the debugger. This should give you a good place to start when you need to track down a memory leak.
If you decide to write your own memory manager keep the following points in mind:
-
Data structures: Choose the data structure that matches your memory allocation scenario. If you traverse a large number of free and available blocks very frequently, choose a hash table or tree based structure. If you hardly ever traverse it to find free blocks you could get away with a list. Store the data structure separately from the memory pool; any corruption will keep your memory manager's data structure in tact.
-
Single/multithreaded access: Don't forget to add appropriate code to protect your memory manager from multithreaded access if you need it. Eliminate the protections if you are sure access to the memory manager will only happen from a single thread and you'll gain some performance.
-
Debug and testing: Allocate a little additional memory before and after the block to detect memory corruption. Add caller information to the debug memory blocks; at a minimum you should use __FILE__ and __LINE__ to track where the allocation occurred.
Small Memory Allocations
One of the best reasons to extend the C runtime memory manager is to write a better system to manage small memory blocks. The memory managers supplied in the C runtime or MFC library are not meant for tiny allocations. You can prove it to yourself by allocating two integers and subtracting their memory addresses as shown here:
int *a = new int; int *b = new int; int deltal = ((int)b - (int)a) - sizeof(int);
The wasted space for the C runtime library was 28 bytes for a release build and 60 bytes for the debug build under Visual Studio.NET. Even with the release build, an integer takes eight times as much memory space as it would if it weren't dynamically allocated.
Most games overload the new operator to allocate small blocks of memory from a reserved pool set aside for small allocations. Memory allocations that are larger than a set number of bytes can still use the C runtime. I recommend that you start with 128 bytes as the largest block your small allocator will handle, and tweak the size until you are happy with the performance.
Категории