Floating & Fixed point representation
682 words
4 minutes
Floating-Point and Fixed-Point Precision in Computing
1. What are Floating-Point and Fixed-Point Numbers?
1.1 Floating-Point
- Represents real numbers using scientific notation (mantissa × base^exponent).
- Can represent a very wide range (tiny to huge), but with limited precision.
- Used in scientific, engineering, and graphics calculations.
- Example: 123.45 in base 10 → 1.2345 × 10^2
1.2 Fixed-Point
- Represents numbers with a fixed number of decimal or binary places.
- Good for precise, predictable arithmetic, especially in embedded/financial applications.
- Range and resolution are limited by the number of integer/fractional bits.
- Example: 123.45 with 2 decimal digits = 12345 (integer) with implicit divisor 100.
2. Floating-Point Representation (IEEE 754 Standard)
2.1 IEEE 754 Single-Precision (32 bits)
Sign | Exponent (8 bits) | Fraction (23 bits) |
---|
1 | 8 | 23 |
- Value = (-1)^sign × 1.fraction × 2^(exponent-bias)
- Sign: 0=positive, 1=negative
- Exponent: Encoded with a bias (127 for 32-bit, 1023 for 64-bit)
- Fraction (Mantissa): Represents the digits after the binary point.
2.2 Example: Encoding 5.75 in IEEE 754 (32-bit)
- 5.75 in binary: 101.11 = 1.0111 × 2^2
- Sign: 0 (positive)
- Exponent: 2 + 127 = 129 = 1000 0001
- Fraction: 01110000000000000000000 (drop the leading 1)
- Bits: 0 | 10000001 | 01110000000000000000000
2.3 Normalization
- Normalized numbers: The most significant bit of mantissa is always 1 (except for denormals/zero).
- Ensures unique representation and maximum precision.
- Example: 0.25 = 1.0 × 2^-2 (normalized: mantissa 1.0, exponent -2)
- Denormalized: Used to represent numbers very close to zero (subnormal numbers), where the exponent is all zeros.
2.4 Exponent Bias
- Bias allows exponent to represent both positive and negative powers of two.
- For single-precision, bias = 127.
- Exponent field 0 → actual exponent -127 (used for denormals)
- Exponent field 255 → special values (Inf, NaN)
- Actual Exponent = Exponent field - Bias
2.5 Example: -0.15625 Representation
- Decimal to binary: 0.00101 = 1.01 × 2^-3
- Sign = 1 (negative)
- Exponent = -3 + 127 = 124 = 0111 1100
- Mantissa = 010000… (fill with zeros to 23 bits)
3. Fixed-Point Representation
3.1 Structure
- Store integer value; scale by implicit or explicit factor.
- Qm.n notation: m integer bits, n fractional bits.
- Q8.8: 8 integer bits, 8 fractional bits, total 16 bits.
- Value = Integer Representation × 2^(-n)
- Example: To store 5.75 in Q8.8:
- 5.75 × 2^8 = 1472
- Store as integer 1472 (0x05C0)
- Read back as 1472 / 256 = 5.75
- Storing 3.125: 3.125 × 256 = 800 → binary 0000 0011 0010 0000
4. Precision and Error Analysis
4.1 Floating-Point Precision Errors
- Not all decimal fractions can be represented exactly in binary (e.g., 0.1 is repeating).
- Rounding errors accumulate in computation (cancellation, catastrophic loss).
- Machine epsilon: smallest difference distinguishable from 1.0 (≈1.19e-7 for float, ≈2.22e-16 for double).
4.2 Example: Adding 0.1 Ten Times in C
#include <stdio.h>
int main() {
float x = 0.0;
for (int i = 0; i < 10; ++i) x += 0.1f;
printf("%.20f\n", x); // Output: 0.99999904632568359375 (not 1.0)
return 0;
}
4.3 Fixed-Point Errors
- Errors due to limited resolution (quantization).
- Overflow if result exceeds range.
- More predictable, but range and step size are fixed by design.
4.4 Example: Fixed-Point Overflow
- Suppose Q8.8 max value ≈ 127.996
- Adding 80 + 80 (as Q8.8: 80×256=20480) → 20480+20480=40960, stored as 16-bit = -24576 (wraps around)
5. Which Is Best?
5.1 Floating-Point
- Best for wide dynamic range and scientific calculations.
- Hardware support on most CPUs (FPU).
- Subject to rounding errors, non-associativity (order of operations matters).
- Standardized (IEEE 754) and portable.
5.2 Fixed-Point
- Best for embedded systems, DSP, and financial calculations where range and precision are known and must be predictable.
- Faster on hardware without FPU; less power/area.
- Precision and overflow/underflow must be managed carefully.
5.3 Normalization Trade-Offs
- Normalization in floating-point maximizes precision, but can introduce underflow/overflow for numbers outside representable range.
- Denormalized numbers allow gradual underflow but reduce precision.
6. Summary Table
Type | Range | Precision | Speed | Error/Limitations | Hardware Support |
---|
Float | Huge (1e-38~1e38) | ≈7 decimal digits | Fast (with FPU) | Rounding, cannot represent all numbers exactly | Most CPUs/GPUs |
Double | Very Huge | ≈15 decimal | Slower | Lower rounding error | Most CPUs/GPUs |
Fixed-Point | Configurable | Configurable | Fast | Limited range, quantization, overflow | All MCUs, FPGAs |