# Floating point numbers - what else can be done?

In a recent article here in The Register we saw some of the problems that result when floating point numbers are misused or chosen inappropriately. Many people wrote in to say they had seen first hand some of the voodoo techniques we decried, so clearly we're in the midst of a numerical calculation crisis and, if we don't do …

#### Back to the Past (or Future)

The more things change, the more things stay the same. Back in the day (1965 is pretty close) the (bcd based) IBM 1620 had hardware floating point (up to 28 digits were supported in Fortran!) and if you had the hardware (yes, it had hardware floating point) you could get the Fortran to use that. So, we've come full circle. We get a BCD base 10 floating point endorsed by IEEE. Some things never change!

#### is this really a problem ?

i can't get past the thought that these are solutions in search of a problem - following on from comments on the previous article about banks and financial institutions not using floating point numbers.

at work we use what i guess would be called scaled integers for our calculations - and these are calculations on VERY large sums of money for clients who would be very (sorry, VERY) annoyed if rounding errors started creeping in to interest calculations etc. a small error on a calculation involving a few hundred million pounds can still be pretty big!

e.g. for a calculation involving 1.3 and 63.215 those numbers would get nowhere near our systems.

they would be stored as:

13 * 10^-1 and 63215 * 10^-3

thats a 13 and a -1 and a 63215 and a -3

then its relatively straightforward mathematics involving the numbers and the powers

Addition: 1.3 + 63.215

equates to

63215 + (13 * 10^(-1 - -3)) to get the numbers on the same 'scale'

= 63215 + (13 * 10^2)

= 63215 + 1300

= 64515

remembering to reapply the initial power this is 64515 * 10^-3 (64.515) so 64515 and -3 are stored for the result.

Multiplication: 1.3 * 63.215

Multiplying the numbers:63215 * 13 = 821795

Calculating the correct power of the result: 10^(-3 + -1)

= 821795 * 10^-4 (or 82.1795)

so the result stored would be 821795 and -4

.....with no loss of precision.

is this not a case of a lack of knowledge of a solution that has been used for decades rather than a need for a new solution ? (if you want to get something done properly ask a bloke who has worked on a mainframe)

#### In support of fixed point

Handling this kind of problem was one of the design goals of the RTL/2 language developed at ICI back in the early '70s - as you can imagine it helps to get your numbers right when you're controlling valves, meters, etc in chemical plants and refineries.

RTL/2 has a binary fraction fixed point type as standard which provides an easy route to scaling values that fall within a known range. Combined with explicit rules about how expressions are evaluated (e.g. on when double-length intermediate types are used, or on how to combine different types) it provides the basic tools to manage numerical issues of this sort across a wide range of applications.

Of course you do still need to know what you're doing...

As an aside, the RTL/2 user manual is the only one I've ever come across which introduces floating point numbers and types at the start but waits until over half way through before covering integers.

#### Re: is this really a problem ?

The java class java.math.BigDecimal takes care of all the re-scaling for you, it allows explicit rescaling and power of ten calculations too, however best of all, it allows you to pick one of eight scaling modes, for divisions or power of ten scaling changes, also additional features, like an optional precision, were added to java.math.BigDecimal in Java 1.5 and above.

The Big in the BigDecimal name is intentional, given that a java.math.BigDecimal instance holds a java.math.BigInteger instance (a potentially very large binary number container, formed from an array of 32-bit unsigned values), thus free memory is the only real restriction on number resolution and fast 64-bit long integer maths steps are used instead of slow BCD maths steps.

Note: BigDecimal and BigInteger are immutable, any calculations make new instances of these objects, there is a MutableBigInteger, but Sun annoyingly did not make this a public class!

#### Floats aren't the problem

Why are floats getting a bad rap? Every data type has its limits, and requires that you know how to use it properly to avoid errors. Other data types are better at certain types of calculations, but only if the programmer knows when to use them, and what the differences are between them.

Integers are great for counting things, but don't kid yourself that they solve your rounding problem in any way. They only make it much worse and totally out of your control. And if you move the decimal over enough that the always-truncate rounding method doesn't seem like such a problem, you had better watch your upper bound if you want to do any real math (like multiplying two values together).

BCD can help avoid rounding when it seems unnecessary in decimal, but even ignoring the computational overhead, it is easy to contrive situations where BCD rounding is signifigantly worse than float.

Data types are just tools. Only a very poor craftsman would use a screwdriver to drive a nail, or blames his hammer for not being good at cutting boards. Or rounds a float and crashes his satellite.

#### Not an issue at all

If the programmer is aware of the issues and how IEEE mathematics -- if FULLY implemented (including both round and guard bits) -- works then the precision problems described in the article are very rarely the problem. Almost always the problems are due to programmers coding things up to do the math in a way that inherently loses precision. Too few programmers are fully trained in numerical analysis and the ways to avoid these pitfalls.

If you need more than double-double precision (sometimes referred to as quad precision) then you fall into one of two camps: 1) you are dealing with either very large or very small numbers and thus over or under flow the exponent defined in double-double (unfortunately no larger than double); or 2) you don't know what you're doing.

Real world measurements are extremely rarely more than 5 or 6 decimal digits. To expect more than those same 5 or 6 decimal digits out of the final answer is delusional. Often real world measurements are no more than 3 or 4 decimal digits yielding that same 3 or 4 decimal digits of meaning full information in the answer.

Under proper usage double-double can give you more than 30 decimal digits. If your software is causing you to round off more than 24 or 25 digits then you've set up the methodology of your software wrong.

I've used BCD machines before. Burroughs produced one which would do as much precision as you wanted (in FORTRAN 4). Want 100 digits of digital precison? It was there, if you were willing to take the negative performance hit. It was used primarily for (and owned by) a financial institution but I could get free time on it in the evenings for my scientific work.

When I converted everything over to binary systems (using TOPS-20 on a DECSystem-20 and quad precision) they came over with extremely little impact on precision. After tweeking the code to be in line with DEC's version of FORTRAN-77 all precision impacts were eliminated.

Except in very rare cases, I can't help but believe other situations are the same.

#### Interval Arithmetic

One of the common uses of floating point arithmetic is in

modelling real-world phenomena. The techniques described in

the article I believe are inadequate to handle some of the

issues that crop up in trying to solve these models. For

example: we often have measurement error; know physical

constants to varying degrees of accuracy; discretize data

that is supposed to represent a continuum; use algorithms

that may not be numerically stable over the entire domain;

or deal with problems that are inherently stiff. The real

world is nonlinear.

There is an alternative to using fixed point schemes

(whether integers, scaled integers, rationals, or floating

point numbers as approximations to a continuum), which is

to compute with sets of numbers. For reasons of efficiency

and to take advantage of hardware acceleration, we

generally use intervals, defined as the set of all numbers

between a lower bound and upper bound [a,b].

By using intervals we can represent measurement error,

floating point rounding error, and imprecise constants in a

unified and consistent way. For some of us, one of the

greatest strengths of this approach is that when you

compute something, you also obtain an indication of the

quality of the answer i.e. the width of the interval.

Some problems have traditionally been considered

intractable, which may no longer be the case when using IA.

Consider a (large) solution space. By eliminating boxes

(multidimensional intervals) where it can be proved the

solution cannot be, you can iterate towards more and more

accurate approximations of the solution, subject to the

precision of the arithmetic being used. As the size of

boxes shrink, switch to higher precision if required. A

classic example of this technique is an Interval Newton

method for finding all roots of a function.

For all of this to work, you do need an implementation of

interval arithmetic, one that guarantees containment of the

true solution for all operator-operand combinations. In my

own work, I use the implementation that is part of Sun's

Fortran compilers.

#### Use Rationals

The objection to using rationals is overstated. It is not difficult to simply allow the numerator and denominator to grow in size as needed. In the days when the cost of bytes of memory was measured in dollars, this idea wouldn't work. However, we now have plenty of memory for doing this.

In financial calculations, the size wouldn't be significant. It could become significant in certain scientific calculations in which accuracy were vital. Of course, if accuracy is vital, then it wouldn't do to use floating point numbers anyway.

In my opinion, floating point calculations have long since passed their usefulness.

#### Arbitrary-precision reals

There has been some work on using continued fractions or unbounded sequences of bits/digits to represent real numbers with arbitrary precision. You normally use a redundant representation to avoid having to normalize after each operation.

There is a limit to what operations you can do -- for example, you can never determine that two numbers are equal. For different numbers, an equality test will eventually find the difference and return false, but if numbers are equal, it will just go on forever looking for a difference. So, instead, you have to ask "can you find a difference within epsilon?".

There is extensive theory of computable reals, and many implementations exist. Using these is, however, very time consuming.

But normal floats are not bad in themselves -- the problem is when they are used by people who are not aware of their limitations, or simply choose to ignore the issue.

#### Even bankers make mistakes

I do find it odd when people say that problems with using floating point numbers are overstated, because they're never used when they shouldn't be in financial and banking applications. Well, I used to work in banking and I suspect some of these problems are only weeded out in testing - if you're lucky. A bad programmer can do a lot of damage to a financial system before he/she is weeded out.

On the use of scaled intergers to avoid the problem, I wonder if there's another hidden issue here? What if the scaling factor is different in different applications - and then you try to integrate the data - using the wrong scaling factor? I wonder how long it would take to notice? That scaling factor must be an important piece of metadata to manage...

#### Read the words of the Guru

For forthright opinions on the pros and cons of any architecture for computation the only place to go is

http://www.cs.berkeley.edu/~wkahan

For example, see

http://www.cs.berkeley.edu/~wkahan/ARITH_17.pdf

"Why is Decimal floating-point hardware a good idea anyway?

Because it can help our industry avoid Errors Designed Not To Be Found."

"The essence of civilization is that we benefit from

the experience of others without having to relive it."

#### fixed point library for c++

In case anyone's interested, I coded a fixed-point math template library for C++. You can find out how to get it at my personal web site www.guntheroth.com, but basically it involves emailing me. My fixed-point templates are faster than scalar floating point on Pentium 4. It is likely to be faster than a rational number library or BigInt based library too, and in most production systems, speed turns out to be important.

Unfortunately, fixed-point won't magically make rounding errors go away. Addition and subtraction are always exact in fixed-point. Multiplication by an integer is always exact. There is a decimal fixed-point type that has the sometimes-desirable property of exactly representing dollars-and-cents values, whereas many decimal fractions start off with no exact representation in floating-point.

But general multiplication and division involve rounding. You can't escape this. There is just no cure for this problem. There are unrepresentable numbers (like tle value of 1/3 on binary computers). There are irrational numbers (like pi or e). Some calculations come out perfectly exact using the right representation, but you have to inspect each computation. There are also methods of managing the error in inexact computations, but they require that you stayed awake during your very tedious Numerical Analysis class.

Sorry. No cure but hard work and close attention. Like usual.