# How to correctly normalize a floating point value in C++?

Maybe I don't understand the IEEE754 standard that much, but given a set of floating point values that are float or double, for example :

56.543f 3238.124124f 121.3f ...

you are able to convert them in values ranging from 0 to 1, so you normalize them, by taking an appropriate common factor while considering what is the maximum value and the minimum value in the set.

Now my point is that in this transformation I need a much higher precision for the set of destination that ranges from 0 to 1 if compared to the level of precision that I need in the first one, especially if the values in the first set are covering a wide range of numerical values ( really big and really small values ).

How the float or the double ( or the IEEE 754 standard if you want ) type can handle this situation while providing more precision for the second set of values knowing that I will basically not need an integer part ?

Or it doesn't handle this at all and I need fixed point math with a totally different type ?

## Answers

Floating point numbers are stored in a format similar to scientific notation. Internally, they align the leading 1 of the binary representation to the top of the significand. Each value is carried with the same number of binary digits of precision relative to its own magnitude.

When you compress your set of floating point values to the range 0..1, the only precision loss you will get will be due to the rounding that occurs in the various steps of the process.

If you're merely compressing by scaling, you will lose only a small amount of precision near the LSBs of the mantissa (around 1 or 2 ulp, where ulp means "units of the last place).

If you also need to shift your data, then things get trickier. If your data is all positive, then subtracting off the smallest number will not damage anything. But, if your data is a mixture of positive and negative data, then some of your values near zero may suffer a loss in precision.

If you do all the arithmetic at double precision, you'll carry 53 bits of precision through the calculation. If your precision needs fit within that (which likely they do), then you'll be fine. Otherwise, the exact numerical performance will depend on the distribution of your data.

Single and double IEEE floats have a format where the exponent and fraction parts have fixed bit-width. So this is not possible (i.e. you will always have unused bits if you only store values between 0 and 1). (See: http://en.wikipedia.org/wiki/Single-precision_floating-point_format)

Are you sure the 52-bit wide fraction part of a double is not precise enough?

**Edit:** If you use the whole range of the floating format, you will lose precision when normalizing the values. The roundings can be off and enough small values will become 0. Unless you know that this is a problem, don't worry. Otherwise you have to look up some other solution as mentioned in other answers.

If you have a selection of doubles and you normalize them to between 0.0 and 1.0, there are a number of sources of precision loss. They are all, however, much smaller than you suspect.

First, you will lose some precision in the arithmetic operations required to normalize them as rounding occurs. This is relatively small -- a bit or so per operation -- and usually relatively random.

Second, the exponent component will no longer be using the positive exponent possibility.

Third, as all the values are positive, the sign bit will also be wasted.

Forth, if the input space does not include +inf or -inf or +NaN or -NaN or the like, those code points will also be wasted.

But, for the most part, you'll waste about 3 bits of information in a 64 bit double in your normalization, one of which being the kind of thing that is nearly unavoidable when you deal with finite-bit-width values.

Any 64 bit fixed point representation of the values from 0 to 1 will have far less "range" than doubles. A double can represent something on the order of 10^-300, while a 64 bit fixed point representation that includes 1.0 can only go as low as 10^-19 or so. (The 64 bit fixed point representation can represent 1 - 10^-19 as being distinct from 1, while the double cannot, but the 64 bit fixed point value can not represent anything smaller than 2^-64, while doubles can).

Some of the numbers above are approximate, and may depend on rounding/exact format.

Having binary floating point values (with an implicit leading one) expressed as

(1+fraction) * 2^exponent where fraction < 1

A division a/b is:

a/b = (1+fraction(a)) / (1+fraction(b)) * 2^(exponent(a) - exponent(b))

Hence division/multiplication has essentially no loss of precision.

A subtraction a-b is:

a-b = (1+fraction(a)) * 2^(exponent(a) - (1+fraction(b)) * exponent(b))

Hence a subtraction/addition might have a loss of precision (big - tiny == big) !

Clamping a value x in a range [min, max] to [0, 1]

(x - min) / (max - min)

will have precision issues if any subtraction has a loss of precision.

Answering your question: Nothing is, choose a suitable representation (floating point, fraction, multi precision ...) for your algorithms and expected data.

For higher precision you can try http://www.boost.org/doc/libs/1_55_0/libs/multiprecision/doc/html/boost_multiprecision/tut/floats.html.

Note also, that for the numerical critical operations +,- there are special algorithms that minimize the numerical error introduced by the algorithm:

http://en.wikipedia.org/wiki/Kahan_summation_algorithm