Floating & Fixed point representation

Karna included in Embedded

2025-04-25 682 words 4 minutes

Floating-Point and Fixed-Point Precision in Computing

1. What are Floating-Point and Fixed-Point Numbers?

1.1 Floating-Point

Represents real numbers using scientific notation (mantissa × base^exponent).
Can represent a very wide range (tiny to huge), but with limited precision.
Used in scientific, engineering, and graphics calculations.
Example: 123.45 in base 10 → 1.2345 × 10^2

1.2 Fixed-Point

Represents numbers with a fixed number of decimal or binary places.
Good for precise, predictable arithmetic, especially in embedded/financial applications.
Range and resolution are limited by the number of integer/fractional bits.
Example: 123.45 with 2 decimal digits = 12345 (integer) with implicit divisor 100.

2. Floating-Point Representation (IEEE 754 Standard)

2.1 IEEE 754 Single-Precision (32 bits)

Sign	Exponent (8 bits)	Fraction (23 bits)
1	8	23

Value = (-1)^sign × 1.fraction × 2^(exponent-bias)
- Sign: 0=positive, 1=negative
- Exponent: Encoded with a bias (127 for 32-bit, 1023 for 64-bit)
- Fraction (Mantissa): Represents the digits after the binary point.

2.2 Example: Encoding 5.75 in IEEE 754 (32-bit)

5.75 in binary: 101.11 = 1.0111 × 2^2
Sign: 0 (positive)
Exponent: 2 + 127 = 129 = 1000 0001
Fraction: 01110000000000000000000 (drop the leading 1)
Bits: 0 | 10000001 | 01110000000000000000000

2.3 Normalization

Normalized numbers: The most significant bit of mantissa is always 1 (except for denormals/zero).
Ensures unique representation and maximum precision.
Example: 0.25 = 1.0 × 2^-2 (normalized: mantissa 1.0, exponent -2)
Denormalized: Used to represent numbers very close to zero (subnormal numbers), where the exponent is all zeros.

2.4 Exponent Bias

Bias allows exponent to represent both positive and negative powers of two.
For single-precision, bias = 127.
- Exponent field 0 → actual exponent -127 (used for denormals)
- Exponent field 255 → special values (Inf, NaN)
Actual Exponent = Exponent field - Bias

2.5 Example: -0.15625 Representation

Decimal to binary: 0.00101 = 1.01 × 2^-3
Sign = 1 (negative)
Exponent = -3 + 127 = 124 = 0111 1100
Mantissa = 010000… (fill with zeros to 23 bits)

3. Fixed-Point Representation

3.1 Structure

Store integer value; scale by implicit or explicit factor.
Qm.n notation: m integer bits, n fractional bits.
- Q8.8: 8 integer bits, 8 fractional bits, total 16 bits.
- Value = Integer Representation × 2^(-n)
Example: To store 5.75 in Q8.8:
- 5.75 × 2^8 = 1472
- Store as integer 1472 (0x05C0)
- Read back as 1472 / 256 = 5.75

3.2 Example: 16-bit Q7.8 format (range -128 to +127.996, resolution ≈0.004)

Storing 3.125: 3.125 × 256 = 800 → binary 0000 0011 0010 0000

4. Precision and Error Analysis

4.1 Floating-Point Precision Errors

Not all decimal fractions can be represented exactly in binary (e.g., 0.1 is repeating).
Rounding errors accumulate in computation (cancellation, catastrophic loss).
Machine epsilon: smallest difference distinguishable from 1.0 (≈1.19e-7 for float, ≈2.22e-16 for double).

4.2 Example: Adding 0.1 Ten Times in C

#include <stdio.h>
int main() {
    float x = 0.0;
    for (int i = 0; i < 10; ++i) x += 0.1f;
    printf("%.20f\n", x); // Output: 0.99999904632568359375 (not 1.0)
    return 0;
}

4.3 Fixed-Point Errors

Errors due to limited resolution (quantization).
Overflow if result exceeds range.
More predictable, but range and step size are fixed by design.

4.4 Example: Fixed-Point Overflow

Suppose Q8.8 max value ≈ 127.996
Adding 80 + 80 (as Q8.8: 80×256=20480) → 20480+20480=40960, stored as 16-bit = -24576 (wraps around)

5. Which Is Best?

5.1 Floating-Point

Best for wide dynamic range and scientific calculations.
Hardware support on most CPUs (FPU).
Subject to rounding errors, non-associativity (order of operations matters).
Standardized (IEEE 754) and portable.

5.2 Fixed-Point

Best for embedded systems, DSP, and financial calculations where range and precision are known and must be predictable.
Faster on hardware without FPU; less power/area.
Precision and overflow/underflow must be managed carefully.

5.3 Normalization Trade-Offs

Normalization in floating-point maximizes precision, but can introduce underflow/overflow for numbers outside representable range.
Denormalized numbers allow gradual underflow but reduce precision.

6. Summary Table

Type	Range	Precision	Speed	Error/Limitations	Hardware Support
Float	Huge (1e-38~1e38)	≈7 decimal digits	Fast (with FPU)	Rounding, cannot represent all numbers exactly	Most CPUs/GPUs
Double	Very Huge	≈15 decimal	Slower	Lower rounding error	Most CPUs/GPUs
Fixed-Point	Configurable	Configurable	Fast	Limited range, quantization, overflow	All MCUs, FPGAs

Contents

Floating & Fixed point representation

Floating-Point and Fixed-Point Precision in Computing

1. What are Floating-Point and Fixed-Point Numbers?

1.1 Floating-Point

1.2 Fixed-Point

2. Floating-Point Representation (IEEE 754 Standard)

2.1 IEEE 754 Single-Precision (32 bits)

2.2 Example: Encoding 5.75 in IEEE 754 (32-bit)

2.3 Normalization

2.4 Exponent Bias

2.5 Example: -0.15625 Representation

3. Fixed-Point Representation

3.1 Structure

3.2 Example: 16-bit Q7.8 format (range -128 to +127.996, resolution ≈0.004)

4. Precision and Error Analysis

4.1 Floating-Point Precision Errors

4.2 Example: Adding 0.1 Ten Times in C

4.3 Fixed-Point Errors

4.4 Example: Fixed-Point Overflow

5. Which Is Best?

5.1 Floating-Point

5.2 Fixed-Point

5.3 Normalization Trade-Offs

6. Summary Table