API
Scalar Functions
- gfloat.decode_float(fi, i)[source]
Given
FormatInfo
and integer code point, decode to aFloatValue
- Parameters:
fi¶ (FormatInfo) – Floating point format descriptor.
i¶ (int) – Integer code point, in the range \(0 \le i < 2^{k}\), where \(k\) =
fi.k
- Returns:
Decoded float value
- Raises:
ValueError – If
i
is outside the range of valid code points infi
.
- gfloat.round_float(fi, v, rnd=RoundMode.TiesToEven, sat=False)[source]
Round input to the given
FormatInfo
, given rounding mode and saturation flagAn input NaN will convert to a NaN in the target. An input Infinity will convert to the largest float if
sat
, otherwise to an Inf, if present, otherwise to a NaN. Negative zero will be returned if the format has negative zero, otherwise zero.- Parameters:
- Returns:
A float which is one of the values in the format.
- Raises:
ValueError – The target format cannot represent the input (e.g. converting a NaN, or an Inf when the target has no NaN or Inf, and
sat
is false)
- gfloat.encode_float(fi, v)[source]
Encode input to the given
FormatInfo
.Will round toward zero if
v
is not in the value set. Will saturate to Inf, NaN, fi.max in order of precedence. Encode -0 to 0 if not fi.has_nzFor other roundings and saturations, call
round_float()
first.- Parameters:
fi¶ (FormatInfo) – Describes the target format
v¶ (float) – The value to be encoded.
- Returns:
The integer code point
Block format functions
- gfloat.decode_block(fi, block)[source]
Decode a
block
of integer codepoints in Block Formatfi
The scale is encoded in the first value of
block
, with the remaining values encoding the block elements.The size of the iterable is not checked against the format descriptor.
- Parameters:
fi¶ (BlockFormatInfo) – Describes the block format
block¶ (Iterable[int]) – Input block
- Returns:
A sequence of floats representing the encoded values.
- gfloat.encode_block(fi, scale, vals, round=RoundMode.TiesToEven)[source]
Encode float
vals
into block Format described byfi
The
scale
is explicitly passed, and thevals
are assumed to already be multiplied by 1/scale. That is, this is pure encoding, scaling is computed and applied elsewhere (see e.g.quantize_block()
).It is checked for overflow in the target format, and will raise an exception if it does.
- Parameters:
- Returns:
A sequence of ints representing the encoded values.
- Raises:
ValueError – The scale overflows the target scale encoding format.
- gfloat.quantize_block(fi, vals, compute_scale, round=RoundMode.TiesToEven)[source]
Encode and decode a block of
vals
of bytes into block format described byfi
- Parameters:
fi¶ (BlockFormatInfo) – Describes the target block format
vals¶ (numpy.array) – Input block
compute_scale¶ ((float, ArrayLike) -> float) – Callable to compute the scale, defaults to
compute_scale_amax()
round¶ (RoundMode) – Rounding mode to use, defaults to TiesToEven
- Returns:
An array of floats representing the quantized values.
- Raises:
ValueError – The scale overflows the target scale encoding format.
- gfloat.compute_scale_amax(emax, vals)[source]
Compute a scale factor such that
vals
can be scaled to the range [0, 2**emax]. That is, scale is computed such that the largest exponent in the array vals * scale will be emax.The scale is clipped to the range 2**[-127, 127].
If all values are zero, any scale value smaller than emax would be accurate, but returning the smallest possible means that quick checks on the magnitude to identify near-zero blocks will also find the all-zero blocks.
- Parameters:
- Returns:
A float such that vals * scale has exponents less than or equal to emax.
Note
If all vals are zero, 1.0 is returned.
Classes
- class gfloat.FormatInfo[source]
Class describing a floating-point format, parametrized by width, precision, and special value encoding rules.
- name
Short name for the format, e.g. binary32, bfloat16
- k
Number of bits in the format
- precision
Number of significand bits (including implicit leading bit)
- emax
Largest exponent, emax, which shall equal floor(log_2(maxFinite))
- has_nz
Set if format encodes -0 at (sgn=1,exp=0,significand=0). If False, that encoding decodes to a NaN labelled NaN_0
- has_infs
Set if format includes +/- Infinity. If set, the non-nan value with the highest encoding for each sign (s) is replaced by (s)Inf.
- num_high_nans
Number of NaNs that are encoded in the highest encodings for each sign
- has_subnormals
Set if format encodes subnormals
- is_signed
Set if the format has a sign bit
- is_twos_complement
Set if the format uses two’s complement encoding for the significand
- property tSignificandBits
The number of trailing significand bits, t
- property expBits
The number of exponent bits, w
- property signBits
The number of sign bits, s
- property expBias
The exponent bias derived from (p,emax)
- This is the bias that should be applied so that
\(floor(log_2(maxFinite)) = emax\)
- property bits
The number of bits occupied by the type.
- property eps
The difference between 1.0 and the next smallest representable float larger than 1.0. For example, for 64-bit binary floats in the IEEE-754 standard,
eps = 2**-52
, approximately 2.22e-16.
- property epsneg
The difference between 1.0 and the next smallest representable float less than 1.0. For example, for 64-bit binary floats in the IEEE-754 standard,
epsneg = 2**-53
, approximately 1.11e-16.
- property iexp
The number of bits in the exponent portion of the floating point representation.
- property machep
The exponent that yields eps.
- property max
The largest representable number.
- property maxexp
The smallest positive power of the base (2) that causes overflow.
- property min
The smallest representable number, typically
-max
.
- property num_nans
The number of code points which decode to NaN
- property code_of_nan
Return a codepoint for a NaN
- property code_of_posinf
Return a codepoint for positive infinity
- property code_of_neginf
Return a codepoint for negative infinity
- property code_of_zero
Return a codepoint for (non-negative) zero
- property has_zero
Does the format have zero?
This is false if the mantissa is 0 width and we don’t have subnormals - essentially the mantissa is always decoded as 1. If we have subnormals, the only subnormal is zero, and the mantissa is always decoded as 0.
- property code_of_negzero
Return a codepoint for negative zero
- property code_of_max
Return a codepoint for fi.max
- property code_of_min
Return a codepoint for fi.min
- property smallest_normal
The smallest positive floating point number with 1 as leading bit in the significand following IEEE-754.
- property smallest_subnormal
The smallest positive floating point number with 0 as leading bit in the significand following IEEE-754.
- property smallest
The smallest positive floating point number.
- property is_all_subnormal
Are all encoded values subnormal?
- class gfloat.FloatClass[source]
Enum for the classification of a FloatValue.
- NORMAL = 1
A positive or negative normalized non-zero value
- SUBNORMAL = 2
A positive or negative subnormal value
- ZERO = 3
A positive or negative zero value
- INFINITE = 4
A positive or negative infinity (+/-Inf)
- NAN = 5
Not a Number (NaN)
- class gfloat.RoundMode[source]
Enum for IEEE-754 rounding modes.
Result r is obtained from input v depending on rounding mode as follows
- TowardZero = 1
\(\max \{ r ~ s.t. ~ |r| \le |v| \}\)
- TowardNegative = 2
\(\max \{ r ~ s.t. ~ r \le v \}\)
- TowardPositive = 3
\(\min \{ r ~ s.t. ~ r \ge v \}\)
- TiesToEven = 4
Round to nearest, ties to even
- TiesToAway = 5
Round to nearest, ties away from zero
- class gfloat.FloatValue[source]
A floating-point value decoded in great detail.
- ival
Integer code point
- fval
Value. Assumed to be exactly round-trippable to python float. This is true for all <64bit formats known in 2023.
- exp
Raw exponent without bias
- expval
Exponent, bias subtracted
- significand
Significand as an integer
- fsignificand
Significand as a float in the range [0,2)
- signbit
1 => negative, 0 => positive
- Type:
Sign bit
- fclass
See FloatClass
- class gfloat.BlockFormatInfo[source]
BlockFormatInfo(name: str, etype: gfloat.types.FormatInfo, k: int, stype: gfloat.types.FormatInfo)
- name
Short name for the format, e.g. BlockFP8
- etype
Element data type
- k
Scaling block size
- stype
Scale datatype
- property element_bits
The number of bits in each element, d
- property scale_bits
The number of bits in the scale, w
- property block_size_bytes
The number of bytes in a block
Pretty printers
- gfloat.float_pow2str(v, min_exponent=-inf)[source]
Render floating point values as exact fractions times a power of two.
Example: float_pow2str(127.0) is “127/64*2^6”,
That is (a significand between 1 and 2) times (a power of two).
If min_exponent is supplied, then values with exponent below min_exponent, are printed as fractions less than 1, with exponent set to min_exponent. This is typically used to represent subnormal values.