GDC 2009 presentation by Jonathan Greenberg, Graphics Lead on Mortal Kombat vs DC Universe about how the team was able to get Unreal Engine 3 to run at 60Hz for their game.
Hitting 60Hz with the Unreal Engine: Inside the Tech of Mortal Kombat vs DC UniverseJon Greenberg MK Team Nathan Mefford Chicago ATG
Why Bother? In general, twitch games require very high framerate. Fast input response demands fast feedback to player Running at 60Hz a basic requirement of fighting genre.
Why Is 60 So Rare? Very few games target 60Hz (< 10% of games) Only 16.7 ms in which to do everything vs 33.3 ms at 30Hz. Implies half the time to do everything this is not correct. In general, this means you have ~1/3 the time, due to fixed cost overhead which cant be removed. Customer doesnt care that you have less time to do everything still wants game to look great. Game must hit 60Hz on both PS3 and Xbox 360, and both versions look as close as possible!
Why 1/3 The Time?rd
Game must run at >= 60Hz not allowed to drop frames (bog). This means we have to set aside headroom that can absorb instantaneous spikes. MK vs DC steady state ~= 9.5 ms per frame. Allows for lot of particle effects and variability. Other genres (even other fighting games) likely need a great deal less slack. Philosophy: Always address worst-case scenarios up front.
The Problem (part 1) Midway had decided to use UnrealEngine 3 (UE3) as basic middleware across all internal games. Using UE3 was required by mgmt. UE3 was (is) designed for 30Hz FPS/3rd person action genre titles. We started with the October 2006 (post Gears of War 1) codebase. Some additional features taken from Epic -la-carte. Ex: MITV, file caching, misc fixes. About 22 months to develop the game.
The Problem (part 2) UE3 brings a lot to the table (nice tools, wide feature set) but imposes a lot of heavy fixed costs. There are also some choices made in the engine that have problematic side effects for 60Hz play (UObject overhead, Garbage Collection, etc). Out-of-the-box fixed cost baseline (especially GPU) too high for a 60Hz title. Eg., Oct06 build GPU baseline ~ 9ms.
Breaking it Down GPU Overhead GPU Fixed costs General rendering overhead Multipass overhead Lighting cost Particle cost Particle cost Cloth & Water Render thread virtual overhead/state caching
GPU Fixed Costs Post-processing Usually the biggest fixed cost. Combine as many operations together as possible to hide work (ie, Bloom+DOF+Gamma+Resolution retarget) Cut as many corners as possible and special case as necessary eg. we use 1 of 3 different DOF methods depending on the case: Normal gameplay: classic blur cross-fade Main Menu/Cinematics: dialating Poisson disc Klose-Kombat: a series of blur planes.
Normal DOF+Bloom effect cost = 1.8 ms
Bloom Bloom is done a little strangely to compensate for linear color range and not having a separate downsample/blur:
We had separate thresholding and strength values for characters and the general background to allow the two to be tuned differently. Character masks were written/read from stencil buffer.
Per environment thresholding value determines which pixels bloom. Thresholding is done inside downsample pass and written out into the alpha channel as 0 or 1. This bloom mask is then blurred along with color.
Distortion Normal UE3 distortion effect has 3ms overhead! Instead, fold Distortion into Translucency. Sample from a snapshot of opaque pass, and do a depth-based selection to prevent neardistortion. Overhead now just capturing snapshot - just a copy blit of color buffer ~ 0.4ms. Now usable everywhere! Optionally support recapture of the snapshot per distorting effect to allow for layered distortion effects as well. Needed for water level.
Motion Blur Very expensive to do full-screen. Epic doesnt support motion blurring of skinned geometry! Instead, motion blur effects done via rendering velocity-stretched fading geometry. Required changing GPU skinning (PC/360) and Edge (PS3-SPU) to support skinning against previous bone positions. Requires localized blur-only Z-prepass to prevent additive blur effects from blending badly.
Shadows and MSAA Game made use of MSAA-2x on both platforms Resolving MSAA is very expensive on PS3. Combine full-screen modulated shadow blit with MSAA color/depth resolve! Hide heavy texture bandwidth operations inside math heavy shadow work. Shadow ALU overhead high enough that we can also hide the Distortion copy blit! No self-shadowing disabled via stencil mask. Once theres no self-shadowing anyway, we use proxy shadow characters. Total cost ~= 1.33ms
Fog Fullscreen per-pixel ~2 ms on the GPU. Visible vertices < visible pixels! Per-pixel fog is often overkill. Replaced with per-vertex fog and per-object fog (characters). To keep per-vertex costs low, only support 2 active fog actors. Heightfog is optional, and controlled via static branching. Also added optional undulating height fog, via pulsing sine-waves through the fog height. Dramatically cheaper!
General Rendering 8 bpc render targets, linear color scale of 0..2. We light in a combination of =1.0 and =2.2, depending on what were lighting, to save cost. Opaque: uses MSAA Translucent: post-MSAA resolve Heavy use of Playstation Edge library for skinned and world geometry on PS3. 3D resolution of the game was 1040x624 which was then scaled up to allow the HUD to render at 1280x720.
Multipass Overhead Pass-per-light overhead is simply too high. Were mostly prelit, so we chose forward rendering. Z-Prepass? Typical depth complexity < 1.5. Loosely sort opaque objects front to back via rings of detail. Removing Z-prepass saves ~0.75 ms. Touch each pixel only once if possible.
World Lighting (static) World is prelit using Illuminate Labs Beast, with some dynamic RNMs built with Turtle. Dynamic RNMs are animated in materials or via MITVs. Prelit lighting was a mix of texture and vertex RNM lighting, with a fast-path added to support per-vertex diffuse only RNM evaluation for distant objects.
World Lighting (dynamic) Effect point lighting is done via a mix of perpixel lighting (floors) and per-vertex (the rest of the environment). To account for maximum load, shaders are built with three diffuse-only point lights active and burned into the material No branching! All three lights always evaluated. These lights are globally assigned and managed in 3-deep FIFO.
Character Lighting (part 1)Custom lighting model: Irradiance volume of SH coefficient sets. Eval gradients to determine an SH-set per object. Diffuse light the model using only the first 4 coefficients (ambient and directional term). The 3 effect point lights are evaluated per-vertex and combined into the final diffuse lighting result. Spec faked via power-scaling of (EN) and multiplying by diffuse lighting.
Character Lighting (part 2) Skin transmission faked by using (EN) as lerp factor between diffuse lighting and SH ambient term. Rim Lighting: power-scaling (1-EN) for falloff and then mul by hard thresholding (1-EN). If threshold is raised high enough (~0.7), ends up looking like chrome mapping!. Final rendering cost ~= 0.8ms per character Character mesh-chucks batch rendered.
Skin and Metal
The Story So Far So far costs are: Misc Shadowmaps: Characters: Environment: MSAA Resolve/Shadow: PostFX: Total ~0.5 0.5 1.6 ~4.X 1.3 1.8 ~9.X ms ms ms ms ms ms ms
What about particle effects?
Particle Effects Very large problem. Cascade not very optimal. Solution port Cascade runtime async on separate worked threads (to SPU on PS3)! All emitters for a particle system updated in single block of async work (particles, emitter state, system state). All particle Modules ported to SPU, except for collision (due to data complexity).
Particle Effects (CPU load) All per-particle overhead removed from Game/Render thread! Particle overhead now a simple linear relationship between system count and emitter count. On PC/360, vertex data for sprites created JIT by async worker thread. No changes/compromises to artist tools or workflow.
Particle Effects (SPU load) SPUs extremely fast. Just used basic C++ code (including templates and polymorphism). No need to bother with intrinsics or ASM. Same module code runs on PS3/360. Complex (dependant) DMAs done synchronously. Simpler to deal with and fast enough that it doesnt matter. Update done via SPURS job
Particle Effects (GPU load) GPU overhead less straightforward Attempt 1: Lie to hardware and tell it were in MSAA-4x on non-MSAA target. Looks okay on wispy stuff in general (smoke, fire, etc.), but looks terrible on 360.
Particle Effects (GPU cont) Attempt 2: for somewhat opaque particles, break effect out into masked pass and unmasked pass, sorting particles for a system front to back before rendering to prime Z.1. Render particles with alpha-test set to =1.0, front to back 2. Render particles with alpha-test set to 50%). Requires artist to identify channel to scan for image bounds.
General Render Thread Optimizations Lots of work to reduce unnecessary operations. Render thread virtuals = death by a thousand paper cuts. Cache as much state as possible to reduce redundant virtual calls. Eg, replaced FMaterialRenderProxys GetMaterial virtual call with a caching call. Remove tons of unneeded repeated calls to GetXXX() (ie, GetPixelShader) states from inside Shader processing.
Misc Further optimizations Cloth simulation moved to run async in another thread (SPU on PS3). Epics water simulation code ported to run on SPU on PS3. Animation still synchronous Game-thread based, but doesnt use AnimTrees. Very limited blend options for designers. No Occlusion pass Vis is simple frustum culling. Lots of work to reduce amount of memory allocation via pools and isolated heaps. Still, accounts for 25% of CPU time.
Garbage Collection Based on work by Stranglehold team Not quite as aggressive as they were, but removes all live calling of GC from gameplay only called when exiting modes. Memory management switched to deferred (by a frame) cleanup of UObjects/AActors. All loaded data trapped via Rootset Introduces UResource class, a reference counting UObject. All USurface derived classes (ie, UMaterial, UTexture, etc) are all reference counted via UResource to prevent unwanted deletion.
Additional Game Details We dont use UnrealScript. Minimally use Kismet. Use our own scripting engine (C/C++ish) for AI, object management, menu logic, etc. Game scripts are expected to manage resource lifetimes. Main advantage dynamically reloadable for fast iteration! MKScripts describe resource usage to determine cooked resources that need to be added to characters/backgrounds.
Artist Limitations UE3 gives artists a lot of rope to hang themselves with. Big thing was to limit who could use the Material Editor. All character art uses same small set of materials. Characters budgeted at 20k polys visible at a time. Backgrounds budgeted based on visible object count and storage limitations more than polycount. Environment material/lighting complexity managed by the background lead to ensure overall performance hit GPU performance targets, with various metrics helping to tell them where they were.
General Recommendations for hitting 60Hz in UE3 Budget performance up front! Given Edge and 360s unified shaders, geometry less of a problem than fillrate. Predetermine valid PostFx and hardwire the majority of permutations. Reduce dynamic critical sectioned memory allocation as much as possible. Massively stalls all performance. Use pool allocators whenever possible, and watch for reallocs. Force designers and artists to run with performance metrics on!
Recommendations for hitting 60Hz in UE3 on PS3 (well, and 360) Consider what can be deferred and/or can be made to run async and consider moving that work. Consider using Edge on PS3. Even syncd work can be done way faster on SPU if divided over multiple SPUs/threads! Dont be intimidated by the SPUs on PS3. Prototype SPU code on 360/PC where its easier to debug. Template heavy C++ might not be ideal performance case for SPUs, but certainly a LOT better than not using them at all.
Things We Have Yet to Address Serialization as we tend to only stream content underneath movie playback or load screens, the CPU impact wasnt too problematic for us, though it does impact load times. Animation need to explore making it run on worker threads/SPU for deferrable (background and LODd) objects.
Questions? Thanks for listening!