Page 1 of 1

game benchmarking

PostPosted: Wed Feb 08, 2006 10:28 pm
by Jonathan
Reviewers should report CDFs of framerates, not average FPS or FPS over time. I'd also accept a PDF, but I think a CDF would be better because you could see if a particular card runs HL2 10% of the time with less than 30fps just by glancing at the graph.

I've posted some tentative steps towards reporting minimums in additions to means, or FPS over time, but these are just baby steps on the way to a real CDF graph. That way every reader can pick from himself what constitutes an acceptable user experience. Reviewers can choose some limits and use those as the basis for their comparisons, so the non-technical crowd which just reads the conclusions anyway can get the same output.

PostPosted: Thu Feb 09, 2006 6:22 am
by quantus
How are the fps numbers calculated? Litreally count frames rendered in a second or is it done the real way of measuring the time to render a frame and invert the result?

Anyways, I'd set the limit nearer to 99% at 30fps or better. Of course, I'd want to see where the tail goes... If it trails off to 1fps, that's unacceptable at any time. If it trails off to 27fps, I wouldn't mind an 80% limit.

PostPosted: Thu Feb 09, 2006 5:56 pm
by Jonathan
The beautiful thing about this methodology is that if you want 99% above 30fps and I only want 90%, then both data points are available to us. Additionally, the tail is also visible.

What your qualifications on what is acceptable demonstrate is that you really need a CDF graph to determine exactly what constitutes an acceptable user experience because it depends on many factors.

FPS calculations depend on the benchmark. Any Id software game timedemo has a static number of frames that it renders as fast as possible and measures the total time required. For CDF, you'd probably be using Fraps, because few games provide the level of detail necessary. I believe Fraps uses a sampling method. Each constant window of time, it counts the number of frames rendered and reports that as the "instantaneous" fps. I should double check that, though.

PostPosted: Thu Feb 09, 2006 10:16 pm
by quantus
For open sourced engines, it should be possible to write in your own frame timer. Besides, I would think that game programmers worth their salt would already be including this timing functionality into their engines and looking for worst case performance. Maybe performance optimization is a lost art these days and ever faster hardware is just being chewed up by crappy programming.

Maybe nVidia or ATI could include a frame timer into their drivers? Still, I could see programmers gaming the system by offloading as much of the dataloading as possible to happen outside of the actual drawing of a frame.

PostPosted: Thu Feb 09, 2006 11:06 pm
by Jonathan
Fraps is sufficient for Windows games. For Linux or OS X games, I'm not sure what's out there. Anandtech has an old article about using fps over time using some custom software, but I haven't seen them adopt that methodology in their reviews.

PostPosted: Thu Feb 09, 2006 11:27 pm
by Jonathan
I had written a post about how the sampled CDF, or empirical distribution function, is what I am really after, not the actual CDF. Nobody knows what you're talking about when you type EDF, though.

PostPosted: Fri Feb 10, 2006 2:05 am
by quantus
There isn't any real reason to sample though. The actual data is obtainable.

PostPosted: Sun Feb 12, 2006 6:36 pm
by Martin
I think how you'd want it is: for each frame, record the time it took to render it. From this, you can calculate the CDF by inverting the numbers (x := 1/x), then sort them.

I guess ideally you would play the entire game this way to get an idea of the big picture. It wouldn't create too much data, at 60 fps for a 10 hour game, 2.16 million doubles, or 17 megabytes.

Here's one problem though, too much attention is drawn toward fast frames. The fact is, if 1/60 of your frames are at 1 FPS and 59/60 of your frames are at 60 FPS, more than half of the time you are in 1 FPS mode. You want your CDF to give you the probability that you are encountering a frame of that speed or faster at any given moment in time. Bias this by increasing each number by its actual render time. In other words, at point n+1, record its height as height n plus the time it took to render point n+1 over sum total render time (to normalize).

PostPosted: Mon Feb 13, 2006 3:16 pm
by Jonathan
Measuring over the entire game is not repeatable. The code executed and the scenes rendered will vary from run to run, making it unfair as a benchmark. Nor is it reasonable to test many games with many different configurations by playing the whole game. That kind of testing will take too long.

A good benchmark can be run in a matter of minutes, not hours or seconds, and ideally executes the same code no matter what the hardware configuration used is. With shorter benchmarks, the run-to-run variation due to OS overhead, disk seek time, or DRAM page hits can be excessive. With longer benchmarks, they take too long to get results.

When measuring throughput, one can't guarantee that exactly the same code will be executed on all systems, because systems with higher throughput will execute more code. However, for game benchmarks, if one keeps the scenes to be rendered identical between configurations, then the benchmark is relatively fair. Such a test matches actual game play rather well.

Unreal Tournament 2004 has a builtin benchmark which I think is good example of a reasonable benchmark. The game constructs a match between AI bots. The seed is reused from run to run so the bots' actions are repeatable.

Older games, such as Quake 3 Arena, rendered a fixed collection of frames as fast as possible. This style of benchmark is very fair, because the workload does not vary from machine to machine. However, the utility of such a benchmark for evaluating CPU performance is poor, because none of the code the CPU executes during the actual game is run during the benchmark.

Fixed interval sampling also gets rid of bias. I believe fixed interval sampling is the best plan for two reasons. First, benchmarkers can use fixed interval sampling to produce EDF graphs using technology which is widely available now. Second, when measuring performance one should use as little instrumentation as possible to avoid interfering with the outcome.

PostPosted: Wed Feb 15, 2006 7:08 am
by quantus
Agreed, there has got to be a limit to how realistic and long the demo is and Jonathan's right, it has to be repeatable. However, I don't like EDFs or sampling.. They'll average out the 1s frame. I'd imagine the render times fitting a Weibull Distribution preferably plotted like this. This sort of plot emphasizes the tail of the distribution which is what people really care about. People only want higher frame rates because usually the tail will not drift down into an intolerable range if frames can mostly be generated at 60+fps. As Martin alluded to, 120fps most of the time means Jack if every 120th frame takes 1s and you average out to 60 fps. I'm sure that writing out a timestamp at the beginning of every frame won't slow up the system too much, especially if the write is buffered to a block at a time. Besides, if every game has the same overhead, it's still comparable.

PostPosted: Thu Feb 16, 2006 5:18 am
by Jonathan
quantus wrote:I'd imagine the render times fitting a Weibull Distribution preferably plotted like this. This sort of plot emphasizes the tail of the distribution which is what people really care about.

The plot you linked to shows how closely a given set of data matches a theoretical distribution. Is that what you meant to link? The correlation to some probability distribution wouldn't emphasize the tail of the distribution. The data could be a very weird, non-smooth distribution that doesn't match a Weibull at all.

PostPosted: Fri Feb 17, 2006 7:59 am
by quantus
I meant this link sorta... It's a bit hard to see the labels on each axis, but it should be legible enough to get the point across. I had a better picture before, but apparently I didn't copy it correctly.

PostPosted: Tue Feb 13, 2007 5:49 am
by Jonathan
I contacted one of the hardware sites I trust and explained my idea to them. They were very receptive, but wanted me to write the article myself. I need to put together some data, but sadly my lack of Windows partition is stopping me from plopping down FRAPS and running some UT2004 botmatches to collect data.