View Full Version : Is Double Precision Really Needed?

Ed M.
08-13-2003, 02:55 PM
OK, guys, let's revisit this issue of double-precision within what are generally considered "high-quality" 3D renderers.

Lately I've been hearing from multiple sources that double-precision isn't really needed to produce extremely high-quality 3D output. I posted the excerpt that appears below on the NewTek forums a while ago and thought that it would be a good time to revisit it and perhaps finally lay some myths to rest.

I included the entire discussion because I wanted to *also* revisit the idea that a lot of people have about adding double-precision capability to AltiVec. Adding double-precision to AltiVec wouldn't be a good idea. In the end the addition of more FPUs is really the better way to go.

What it boils down to is that a large part of the reason why the G4 performed so poorly within 3D apps is because of poor code as well as code that didn't fully exploit the features of the particular hardware (processor and system) nor the advantages that it offered. Of course, Motorola didn't help much by only providing what is best described as a single, anorexic buss. These issues have been eliminated with the G5.

The G5 among its other significant enhancements, adds an additional FPU and also a dual-buss with MONSTER BANDWIDTH on *each* to help move things along. Coupled with better hardware and OS, the G5 screams through all that poorly optimized G4 code with ease. Imagine if developers decide to take advantage of all the features the G5 offers. They didn't do it with the G4. Sad because there could have been so much more realized performance. This is how customers were rewarded after shelling out loads of $$$ expecting top workmanship?? Anyway, that's a whole other topic that's being discussed... On with the excerpt...


Ed M.
08-13-2003, 02:56 PM
Here is what I posted once before:

As usual, I snooped around and found some interesting tidbits that many people fail to notice then I checked them for accuracy and validity by asking some legitimate sources... Many users and marketing-types absolutely swear by the "quality" of renders that a double-precision calc would produce. I notice that these claims fail to mention any threshold with respect to human limitations of sight and vision. There is a point where the human eye, no matter how good your vision, will not be able to discern/resolve any increase in resolution/quality even if it was there. And since we are talking about full-motion animated 3D scenes shot on current monitors and TVs, many tricks can be played out on human vision; even the best human vision.

From what I've discovered, It's reasonable to believe that you don't *need* double precision for 3D, unless you are really, really sloppy with your algorithms (and code) and all that it's useful for is covering up the errors that are produced. In short, double precision calcs are usually employed because you can get away with a lot more slop in your coding. Taken from a discussion with a Ph.D. and cross-platform developer.

Q: Is an updated double precision-centric AltiVec unit the way to go?

A: No.

This is why:

The vector registers have room for four single precision floats to fit in each one. So for single precision, you can do four calculations at a time with a single AltiVec instruction. AltiVec is fast because you can do multiple things in parallel this way.

Most AltiVec single precision floating point code is 3-4 times faster than the usual scalar single precision floating point code for this reason. The reason that it is more often only three times faster and not the full four times faster (as would be predicted by the parallelism in the vector register I just mentioned) is that there is some additional overhead for making sure that the floats are in the right place in a vector register, that you don't have to deal with in the scalar registers. (There is only one way to put a floating point value in a scalar register.)

Double precision floating point values are twice as big (take up twice as many bytes) as single precision floating point values. That means you can only cram two of them into the vector register instead of four. If our experience with single precision floating point translates to double precision floating point, then the best you could hope to get by having double precision in AltiVec is a (3 to 4)/2 = 1.5 to 2 times speed up.

Is that enough to justify massive new hardware on Motorola's or Apple's part?

In my opinion, no.

This is especially true when one notes that using the extra silicon to instead add a second or third scalar FPU could probably do a better job of getting you a full 2x or 3x speed up, and the beauty part of this is that it would require absolutely no recoding for AltiVec. In other words, it would be completely backwards compatible with code written for older machines, give *instant speedups everywhere* and require no developer retraining whatsoever. This would be a good thing.

Even if you still think that SIMD with only two way parallelism is better than two scalar FPU's, you must also consider that double precision is a lot more complicated than single precision. There is no guarantee that pipeline lengths would not be a lot longer. If they were, that 1.5x speed increase might evaporate -- Quickly.

Yes, Intel has SSE2, which has two doubles in a SIMD unit. Yes, it is faster -- for Intel. It makes sense for Intel for a bunch of reasons that have to do with shortcomings in the Pentium architecture and nothing to do with actual advantages with double precision in SIMD.

To begin with Intel does not have a separate SIMD unit like PowerPC does. If you want to use MMX/SSE/SSE2 on a Pentium, you have to shut down the FPU. That is very expensive to do. As a work around, Intel has added Double precision to its SIMD so that people can do double precision math without having to restart the FPU. You can tell this is what they had in mind because they have a bunch of instructions in SSE2 that only operate on one of the two doubles in the vector. They are in effect using their vector engine as a scalar processing unit to avoid having to switch between the two. Their compilers will even recompile your scalar code to use the vector engine in this way because they avoid the switch penalty.

Okay, so Intel has double precision in their vector unit and despite what I have said, you still think that is absolutely wonderful. But do they Really have a double precision vector unit? The answer is not so clear. Their vector unit actually does calculations on the two doubles in the vector in a similar "one at a time fashion" to the way an ordinary scalar unit would. They only can get one vector FP op through [every two cycles] for this reason. AltiVec has no such limitation.

AltiVec can push through one vector FP op per cycle, doing four floating point operations simultaneously (up to 20 in flight concurrently). AltiVec also has a MAF core, which in many cases does two FP operations per instruction. This is the reason why despite large differences in clock frequency, AltiVec can meet and often beat the performance of Intel's vector engine.

The other big dividend that they get from double precision SIMD is the fact that they can get two doubles into one register. When you only have eight registers this is a big deal! [PowerPC has 32 registers for each of scalar floating point and AltiVec!] In 90% of the cases, we programmers don't need more space in there and the registers the PPC provides are just fine.

Simply put, (from a developers position) we just don't need double precision in the vector engine, and we wouldn't derive much benefit from it if we had it. The worst thing that could possibly happen for Mac developers is that we get it, because that would mean that the silicon could not be used to make some other part of the processor faster and more efficient, and a lot of code would need to be rewritten for little to no performance benefit. It wouldn't be a logical tradeoff.

The only way this would be worthwhile would be to double the width of the vector register so that we get 4x parallelism for double precision FP arithmetic.

And with respect to 3D apps *requiring* double precision...

Most 3D rendering apps do not NEED double precision everywhere. They just need it in a few places, and often (if they really decide to look) they may find that there are more robust single-precision algorithms out there that would be just as good. In the end they should be using those algorithms anyway, because the speed benefits for SIMD are twice as good for single precision than they are for double precision.

Apps like that can get a lot more mileage out of the PowerPC if they just increase the amount of parallelism as much as possible in their data processing. Don't just take one square root at a time, do four etc. And this isn't even taking into account multiprocessing just yet or even AltiVec for that matter. The scalar units alone, by virtue of their pipelines, are capable of doing three to five operations simultaneously! However if you don't give them 3-5 things to do at every given moment, this power goes unused. Unfortunately, this can be noticed in quite a few Mac applications already on the market where performance doesn't seem to be as solid as it should be. What is baffling is why Mac many developers aren't taking advantage of this power. What it boils down to is that most of these apps just do one thing at a time (for the most part), and in turn are wasting 60-80% of the CPU cycles. That's a lot of waste. What's nice is that the AltiVec unit is also pipelined, so it is important to do a lot in parallel there too. The only problem is that developers actually have to make a conscious effort to use the processor the way it was designed to be used. " - (Anonymous source) <end snip>

So, the rule seems to be that using existing double-precision algorithms is *easier* and not generally *better* than the more robust single-precision algorithms that exist which also produce output quality that is equal to the output that many of you have become accustomed to with double-precision renderers.

What double-precision buys you is an extra 29 bits of precision. 2^29 is about 5*10^8, therefore, double precision can tolerate about half a billion times more accumulated error before it reaches some absolute error threshold beyond which there is *too much error*.

What this means is that *if* within the app, the developers are actually doing 1 million calculations on a single pixel before it reaches the screen, then they probably *need* double-precision. A single precision float is only accurate to about 1 part in 16.7 million. On the other hand, one might ask whether they actually needed to do 1 million calculations on a single pixel in the first place. Anyway these are the things that I'm interested in. Developers should be interested in them too because their customers should be getting the best quality apps for their particular platform of choice. <cont>

Ed M.
08-13-2003, 02:58 PM
And there's more... I was speaking with a developer from Germany who suggested possible ways to utilize the AltiVec unit of the *G4* (at the time) in such a way as to solve the double-precision dilemma with respect to 3D. Here is what he had to say:

the 74xx series of PowerPC processors have better scalar FPUs than the MPC75x. The 750's FPU can sustain no more than 3 fp operations per 5 cycles in general, and no more than 3 double precision multiply-adds per 6 cycles. The 7400 improves this to 3 fp ops per 4 cycles even for double precision multipy-adds.

Finally, the 745x pushes the limit to 5 fp ops per 6 cycles, but with a latency of 5 cycles (compared to 3 cycles with the 7400).

With regard to AltiVec, there is some possible speed gain even for double precision number crunching, but that is limited to using the 'data stream
prefetch' functionality. In ATLAS (an open source implementation of 'self-tuning' linear algebra kernels), AltiVec prefetching could improve the performance by 10% - 20% in some cases.

Another option, viable at least for some iterative algorithms, is to do the initial calculation steps in single precision with AltiVec, then refine those starting values with scalar code to double precision.

One more interesting but as yet untested idea has been mentioned here quite some time ago, unfortunately I don't remember the inventor. It is based on the observation that many precision problems can be regarded as caused by very long vector dot products with vector elements of wildly varying magnitude. (I don't have a specific reference, but I remember papers about a special purpose FPU which calculated vector dot products with a 2048 bit mantissa to be able to return a bit-precise double precision result.)

The mentioned idea was to use a vector register as a 128 bit wide mantissa for summing up many single precision results. The program code would not be trivial, but AltiVec provides efficient primitives like bit shifts along a vector, so this might indeed be an option.

Another solution to the same problem is classical: split the list of fp numbers to be added into a positive and a negative group. Sort each group by magnitude. Sum each group starting from small magnitudes, going to large ones (this ensures that the binary point is always as far to the right as possible, providing the maximum possible precision). Finally, subtract the negative partial sum from the positive (this is the only operation which is prone to heavy loss of precision, because if the numbers are equally large, most significant bits of the mantissa cancel out).


08-13-2003, 03:32 PM
Hi Ed. My understanding of your post is that 3D apps on the Mac should be "single precision" 3D renderers rather than "double precision", as the Mac's Altivec/Velocity engine is more efficient with single precision apps.

Can I assume that Lightwave is a double precision 3D application?

It seems that the Pentium processor is very inefficient with floating-point calculations. So Pentiums do things in a different way.

As you are well aware, Lightwave is created in a "common code base" (which is apparently platform agnostic) from which the Mac and Windows versions of Lightwave are derrived. This may make it harder for platform specific changes to be made.

I notice that that other unmentionable recent competitor to Newtek also stated that their application is written in a common code base in the same way. Also, in the old forums I think that Adobe's Chris Cox said that much of Photoshop is written in a common code base, so it seems to be a common thing.

So what is the answer? If single precision is a better thing, why aren't developers using it? Would an application like Lightwave need an extensive rewrite to convert? Would a different renderer be required for Mac and Windows?

Ed M.
08-13-2003, 04:09 PM
Beam , go back and reread the post again -- slower this time.

To answer your first question... No, the G5 added another FPU and should be able to handle double-precision calculations as well as the Pentiums running under SSE2. I don't think you read the points about SSE2 and why Intel uses double-precision within their SIMD engine. You missed a bunch of points. My point as was made clear in the original post was that single-precision *could also* have been used if the time was taken to research better algorithms.

I suspect that Apple and IBM knew that developers were going to continue to be lazy, so on top of the G5's dual bus, they added another FPU per chip. This will provide "instant speedup everywhere" without the need of kludging together a double-precision FPU implementation within a SIMD engine as Intel has done or the need for developers rewriting their apps. If you go back and read, you'll see why Intel included it in their SIMD engine. Anyway...

If developers took the time to implement the more robust single-precision algorithms that exist, then AltiVec would have surely run far ahead of the competition. As it stands now, IBM can always tweak the double-precision performance even further on the G5 by simply adding another FPU which will again show instant speedups without the need of rewriting the code.

Your comment about Photoshop and it's "common code-base"...

Keep in mind that everyone has their interpretation of what a "common code base" is. If you also recall, Chris questioned their knowledge of the Mac hardware as well as their claimed "common code base". He couldn't figure out what they were doing wrong and could only make educated guesses. He offered to look at their code to see what could be done. For the record, NewTek never took Chris up on his offer. What would he say about the accuracy of the information contained in my initial post that started off this discussion? He'd likely endorse it 100%. The information is accurate.

So what is the answer? If single precision is a better thing, why aren't developers using it?

Not that it's better, but rather that it could have been an *alternative* for the G4 to bring it more inline performance-wise. I guess it really doesn't matter with the G5 with it's extra FPUs.. It's likely to contain even more in future releases. Plenty of bandwidth now.

Would a different renderer be required for Mac and Windows?

So, what if a different one *is* required? Shouldn't Mac Lightwavers expect the best possible performance out of the same app PC-users pay for? Aren't they important enough? Some things just need to be more platform-specific. Ted Devlin certainly made that clear when he started the SneakerNet discussions.


08-13-2003, 05:53 PM
Arnie Cachelin (who now works for that other company) once made a comment about the Linux Lightwave renderer.

Only the renderer has so far been ported to Linux. Arnie indicated that porting the renderer was "relatively" easy (compared with the other UI aspects of Lightwave) because there is not much platform specific code in there.

Chuck has hinted that Lightwave 8 (to be released later in the year) will continue with the common code base, but later revisions may have more platform specific optimizations.

By the sounds of things, the Apple G5 will give Newtek the speed increases they want without the need for lots of platform specific code, or a rewrite of the renderer to make it single precision.

Ed M.
08-13-2003, 06:18 PM
By the sounds of things, the Apple G5 will give Newtek the speed increases they want without the need for lots of platform specific code, or a rewrite of the renderer to make it single precision.

That's just it.. It doesn't *have* to be single precision anymore, but that doesn't mean it *shouldn't be* or *couldn't have been*. However, it still doesn't solve the problem of developers not taking full advantage f what the platform has to offer and paying customers ($$$) deserve the best for their dollars. What's more, there are other things that NewTek could be doing to improve the performance across the board (hello? more parallelism?) That would probably benefit BOTH platforms, but the Mac a whole lot more.

I get the sense from the other thread that NewTek (like other app companies) is content to regurgitate and charge customers for the same old code. I read Chuck's post from the other thread and I'm getting the impression that LW8 is going to be a "hold them over until something else is ready" type app.

I don't know what they have planned for the Mac version, but if I was NewTek I'd be taking advantage of whatever help Apple is willing to provide. And I still can't think of a single Mac-Dev. board where I seen any of the NewTek guys. And they have to drop OS9 support. Period. It's silly to keep supporting it. The Irony is that by the time NewTek ships something that's more platform-specific, the competition will likely have beaten them to it. I have a feeling that there will be one company that will take FULL advantage of Apple's new hardware and OS, maybe two. I'm sure you are thinking the same two companies that I am.