Contents

Floating & Fixed point representation

Floating-Point and Fixed-Point Precision in Computing

1. What are Floating-Point and Fixed-Point Numbers?

1.1 Floating-Point

  • Represents real numbers using scientific notation (mantissa × base^exponent).
  • Can represent a very wide range (tiny to huge), but with limited precision.
  • Used in scientific, engineering, and graphics calculations.
  • Example: 123.45 in base 10 → 1.2345 × 10^2

1.2 Fixed-Point

  • Represents numbers with a fixed number of decimal or binary places.
  • Good for precise, predictable arithmetic, especially in embedded/financial applications.
  • Range and resolution are limited by the number of integer/fractional bits.
  • Example: 123.45 with 2 decimal digits = 12345 (integer) with implicit divisor 100.

2. Floating-Point Representation (IEEE 754 Standard)

2.1 IEEE 754 Single-Precision (32 bits)

SignExponent (8 bits)Fraction (23 bits)
1823
  • Value = (-1)^sign × 1.fraction × 2^(exponent-bias)
    • Sign: 0=positive, 1=negative
    • Exponent: Encoded with a bias (127 for 32-bit, 1023 for 64-bit)
    • Fraction (Mantissa): Represents the digits after the binary point.

2.2 Example: Encoding 5.75 in IEEE 754 (32-bit)

  • 5.75 in binary: 101.11 = 1.0111 × 2^2
  • Sign: 0 (positive)
  • Exponent: 2 + 127 = 129 = 1000 0001
  • Fraction: 01110000000000000000000 (drop the leading 1)
  • Bits: 0 | 10000001 | 01110000000000000000000

2.3 Normalization

  • Normalized numbers: The most significant bit of mantissa is always 1 (except for denormals/zero).
  • Ensures unique representation and maximum precision.
  • Example: 0.25 = 1.0 × 2^-2 (normalized: mantissa 1.0, exponent -2)
  • Denormalized: Used to represent numbers very close to zero (subnormal numbers), where the exponent is all zeros.

2.4 Exponent Bias

  • Bias allows exponent to represent both positive and negative powers of two.
  • For single-precision, bias = 127.
    • Exponent field 0 → actual exponent -127 (used for denormals)
    • Exponent field 255 → special values (Inf, NaN)
  • Actual Exponent = Exponent field - Bias

2.5 Example: -0.15625 Representation

  • Decimal to binary: 0.00101 = 1.01 × 2^-3
  • Sign = 1 (negative)
  • Exponent = -3 + 127 = 124 = 0111 1100
  • Mantissa = 010000… (fill with zeros to 23 bits)

3. Fixed-Point Representation

3.1 Structure

  • Store integer value; scale by implicit or explicit factor.
  • Qm.n notation: m integer bits, n fractional bits.
    • Q8.8: 8 integer bits, 8 fractional bits, total 16 bits.
    • Value = Integer Representation × 2^(-n)
  • Example: To store 5.75 in Q8.8:
    • 5.75 × 2^8 = 1472
    • Store as integer 1472 (0x05C0)
    • Read back as 1472 / 256 = 5.75

3.2 Example: 16-bit Q7.8 format (range -128 to +127.996, resolution ≈0.004)

  • Storing 3.125: 3.125 × 256 = 800 → binary 0000 0011 0010 0000

4. Precision and Error Analysis

4.1 Floating-Point Precision Errors

  • Not all decimal fractions can be represented exactly in binary (e.g., 0.1 is repeating).
  • Rounding errors accumulate in computation (cancellation, catastrophic loss).
  • Machine epsilon: smallest difference distinguishable from 1.0 (≈1.19e-7 for float, ≈2.22e-16 for double).

4.2 Example: Adding 0.1 Ten Times in C

#include <stdio.h>
int main() {
    float x = 0.0;
    for (int i = 0; i < 10; ++i) x += 0.1f;
    printf("%.20f\n", x); // Output: 0.99999904632568359375 (not 1.0)
    return 0;
}

4.3 Fixed-Point Errors

  • Errors due to limited resolution (quantization).
  • Overflow if result exceeds range.
  • More predictable, but range and step size are fixed by design.

4.4 Example: Fixed-Point Overflow

  • Suppose Q8.8 max value ≈ 127.996
  • Adding 80 + 80 (as Q8.8: 80×256=20480) → 20480+20480=40960, stored as 16-bit = -24576 (wraps around)

5. Which Is Best?

5.1 Floating-Point

  • Best for wide dynamic range and scientific calculations.
  • Hardware support on most CPUs (FPU).
  • Subject to rounding errors, non-associativity (order of operations matters).
  • Standardized (IEEE 754) and portable.

5.2 Fixed-Point

  • Best for embedded systems, DSP, and financial calculations where range and precision are known and must be predictable.
  • Faster on hardware without FPU; less power/area.
  • Precision and overflow/underflow must be managed carefully.

5.3 Normalization Trade-Offs

  • Normalization in floating-point maximizes precision, but can introduce underflow/overflow for numbers outside representable range.
  • Denormalized numbers allow gradual underflow but reduce precision.

6. Summary Table

TypeRangePrecisionSpeedError/LimitationsHardware Support
FloatHuge (1e-38~1e38)≈7 decimal digitsFast (with FPU)Rounding, cannot represent all numbers exactlyMost CPUs/GPUs
DoubleVery Huge≈15 decimalSlowerLower rounding errorMost CPUs/GPUs
Fixed-PointConfigurableConfigurableFastLimited range, quantization, overflowAll MCUs, FPGAs