API

Scalar Functions

gfloat.decode_float(fi, i)[source]

Given FormatInfo and integer code point, decode to a FloatValue

Parameters:
  • fi (FormatInfo) – Floating point format descriptor.

  • i (int) – Integer code point, in the range \(0 \le i < 2^{k}\), where \(k\) = fi.k

Returns:

Decoded float value

Raises:

ValueError – If i is outside the range of valid code points in fi.

gfloat.round_float(fi, v, rnd=RoundMode.TiesToEven, sat=False)[source]

Round input to the given FormatInfo, given rounding mode and saturation flag

An input NaN will convert to a NaN in the target. An input Infinity will convert to the largest float if sat, otherwise to an Inf, if present, otherwise to a NaN. Negative zero will be returned if the format has negative zero, otherwise zero.

Parameters:
  • fi (FormatInfo) – Describes the target format

  • v (float) – Input value to be rounded

  • rnd (RoundMode) – Rounding mode to use

  • sat (bool) – Saturation flag: if True, round overflowed values to fi.max

Returns:

A float which is one of the values in the format.

Raises:

ValueError – The target format cannot represent the input (e.g. converting a NaN, or an Inf when the target has no NaN or Inf, and sat is false)

gfloat.encode_float(fi, v)[source]

Encode input to the given FormatInfo.

Will round toward zero if v is not in the value set. Will saturate to Inf, NaN, fi.max in order of precedence. Encode -0 to 0 if not fi.has_nz

For other roundings and saturations, call round_float() first.

Parameters:
  • fi (FormatInfo) – Describes the target format

  • v (float) – The value to be encoded.

Returns:

The integer code point

Block format functions

gfloat.decode_block(fi, block)[source]

Decode a block of integer codepoints in Block Format fi

The scale is encoded in the first value of block, with the remaining values encoding the block elements.

The size of the iterable is not checked against the format descriptor.

Parameters:
  • fi (BlockFormatInfo) – Describes the block format

  • block (Iterable[int]) – Input block

Returns:

A sequence of floats representing the encoded values.

gfloat.encode_block(fi, scale, vals, round=RoundMode.TiesToEven)[source]

Encode float vals into block Format described by fi

The scale is explicitly passed, and the vals are assumed to already be multiplied by 1/scale. That is, this is pure encoding, scaling is computed and applied elsewhere (see e.g. quantize_block()).

It is checked for overflow in the target format, and will raise an exception if it does.

Parameters:
  • fi (BlockFormatInfo) – Describes the target block format

  • scale (float) – Scale to be recorded in the block

  • vals (Iterable[float]) – Input block

  • round (RoundMode) – Rounding mode to use, defaults to TiesToEven

Returns:

A sequence of ints representing the encoded values.

Raises:

ValueError – The scale overflows the target scale encoding format.

gfloat.quantize_block(fi, vals, compute_scale, round=RoundMode.TiesToEven)[source]

Encode and decode a block of vals of bytes into block format described by fi

Parameters:
  • fi (BlockFormatInfo) – Describes the target block format

  • vals (numpy.array) – Input block

  • compute_scale ((float, ArrayLike) -> float) – Callable to compute the scale, defaults to compute_scale_amax()

  • round (RoundMode) – Rounding mode to use, defaults to TiesToEven

Returns:

An array of floats representing the quantized values.

Raises:

ValueError – The scale overflows the target scale encoding format.

gfloat.compute_scale_amax(emax, vals)[source]

Compute a scale factor such that vals can be scaled to the range [0, 2**emax]. That is, scale is computed such that the largest exponent in the array vals * scale will be emax.

The scale is clipped to the range 2**[-127, 127].

If all values are zero, any scale value smaller than emax would be accurate, but returning the smallest possible means that quick checks on the magnitude to identify near-zero blocks will also find the all-zero blocks.

Parameters:
  • emax (float) – Maximum exponent to appear in vals * scale

  • vals (ArrayLike) – Input block

Returns:

A float such that vals * scale has exponents less than or equal to emax.

Note

If all vals are zero, 1.0 is returned.

Classes

class gfloat.FormatInfo[source]

Class describing a floating-point format, parametrized by width, precision, and special value encoding rules.

name

Short name for the format, e.g. binary32, bfloat16

k

Number of bits in the format

precision

Number of significand bits (including implicit leading bit)

emax

Largest exponent, emax, which shall equal floor(log_2(maxFinite))

has_nz

Set if format encodes -0 at (sgn=1,exp=0,significand=0). If False, that encoding decodes to a NaN labelled NaN_0

has_infs

Set if format includes +/- Infinity. If set, the non-nan value with the highest encoding for each sign (s) is replaced by (s)Inf.

num_high_nans

Number of NaNs that are encoded in the highest encodings for each sign

has_subnormals

Set if format encodes subnormals

is_signed

Set if the format has a sign bit

is_twos_complement

Set if the format uses two’s complement encoding for the significand

property tSignificandBits

The number of trailing significand bits, t

property expBits

The number of exponent bits, w

property signBits

The number of sign bits, s

property expBias

The exponent bias derived from (p,emax)

This is the bias that should be applied so that

\(floor(log_2(maxFinite)) = emax\)

property bits

The number of bits occupied by the type.

property eps

The difference between 1.0 and the next smallest representable float larger than 1.0. For example, for 64-bit binary floats in the IEEE-754 standard, eps = 2**-52, approximately 2.22e-16.

property epsneg

The difference between 1.0 and the next smallest representable float less than 1.0. For example, for 64-bit binary floats in the IEEE-754 standard, epsneg = 2**-53, approximately 1.11e-16.

property iexp

The number of bits in the exponent portion of the floating point representation.

property machep

The exponent that yields eps.

property max

The largest representable number.

property maxexp

The smallest positive power of the base (2) that causes overflow.

property min

The smallest representable number, typically -max.

property num_nans

The number of code points which decode to NaN

property code_of_nan

Return a codepoint for a NaN

property code_of_posinf

Return a codepoint for positive infinity

property code_of_neginf

Return a codepoint for negative infinity

property code_of_zero

Return a codepoint for (non-negative) zero

property has_zero

Does the format have zero?

This is false if the mantissa is 0 width and we don’t have subnormals - essentially the mantissa is always decoded as 1. If we have subnormals, the only subnormal is zero, and the mantissa is always decoded as 0.

property code_of_negzero

Return a codepoint for negative zero

property code_of_max

Return a codepoint for fi.max

property code_of_min

Return a codepoint for fi.min

property smallest_normal

The smallest positive floating point number with 1 as leading bit in the significand following IEEE-754.

property smallest_subnormal

The smallest positive floating point number with 0 as leading bit in the significand following IEEE-754.

property smallest

The smallest positive floating point number.

property is_all_subnormal

Are all encoded values subnormal?

class gfloat.FloatClass[source]

Enum for the classification of a FloatValue.

NORMAL = 1

A positive or negative normalized non-zero value

SUBNORMAL = 2

A positive or negative subnormal value

ZERO = 3

A positive or negative zero value

INFINITE = 4

A positive or negative infinity (+/-Inf)

NAN = 5

Not a Number (NaN)

class gfloat.RoundMode[source]

Enum for IEEE-754 rounding modes.

Result r is obtained from input v depending on rounding mode as follows

TowardZero = 1

\(\max \{ r ~ s.t. ~ |r| \le |v| \}\)

TowardNegative = 2

\(\max \{ r ~ s.t. ~ r \le v \}\)

TowardPositive = 3

\(\min \{ r ~ s.t. ~ r \ge v \}\)

TiesToEven = 4

Round to nearest, ties to even

TiesToAway = 5

Round to nearest, ties away from zero

class gfloat.FloatValue[source]

A floating-point value decoded in great detail.

ival

Integer code point

fval

Value. Assumed to be exactly round-trippable to python float. This is true for all <64bit formats known in 2023.

exp

Raw exponent without bias

expval

Exponent, bias subtracted

significand

Significand as an integer

fsignificand

Significand as a float in the range [0,2)

signbit

1 => negative, 0 => positive

Type:

Sign bit

fclass

See FloatClass

class gfloat.BlockFormatInfo[source]

BlockFormatInfo(name: str, etype: gfloat.types.FormatInfo, k: int, stype: gfloat.types.FormatInfo)

name

Short name for the format, e.g. BlockFP8

etype

Element data type

k

Scaling block size

stype

Scale datatype

property element_bits

The number of bits in each element, d

property scale_bits

The number of bits in the scale, w

property block_size_bytes

The number of bytes in a block

Pretty printers

gfloat.float_pow2str(v, min_exponent=-inf)[source]

Render floating point values as exact fractions times a power of two.

Example: float_pow2str(127.0) is “127/64*2^6”,

That is (a significand between 1 and 2) times (a power of two).

If min_exponent is supplied, then values with exponent below min_exponent, are printed as fractions less than 1, with exponent set to min_exponent. This is typically used to represent subnormal values.

gfloat.float_tilde_unless_roundtrip_str(v, width=14, d=8)[source]

Return a string representation of v, in base 10, with maximum width width and decimal digits d