38 OCP MXFP8 Streaming MAC Unit

38 : OCP MXFP8 Streaming MAC Unit

Design render

How it works

The OCP MXFP8 Streaming MAC Unit is a high-performance, area-optimized arithmetic core designed for AI inference acceleration. It implements the OpenCompute (OCP) Microscaling Formats (MX) Specification v1.0, supporting a wide range of sub-8-bit floating-point and integer formats with hardware-accelerated shared scaling.

Architectural Overview

The unit is configured in its "Full" edition (2x2 tiles), featuring:

  • Dual-Lane Multiplier: Parallel processing of operands with support for Vector Packing (FP4).
  • 40-bit Aligner & 32-bit Accumulator: High-precision internal datapath to prevent overflow during long dot-product sequences.
  • Shared Scaling (UE8M0): Automatic application of 8-bit exponents ($2^{E-127}$) to element blocks.
  • Flexible Rounding: Support for Truncate (TRN), Ceil (CEL), Floor (FLR), and Round-to-Nearest-Even (RNE).
  • Mixed Precision: Independent format control for Operand A and Operand B within a single MAC block.
  • Logarithmic Multiplier (LNS): Optional area-optimized path using Mitchell's Approximation to reduce multiplier area by >50%.

Streaming Protocol

To maintain a minimal IO footprint (8-bit ports), the unit uses a 41-cycle streaming protocol to process a block of 32 elements ($k=32$).

Cycle Input ui_in[7:0] Input uio_in[7:0] Output uo_out[7:0] Description
0 Metadata 0 Metadata 1 0x00 IDLE: Load MX+ / Debug or Start Fast Protocol.
1 Scale A Format A / BM A 0x00 Load Scale A, Format A, and BM Index A.
2 Scale B Format B / BM B 0x00 Load Scale B, Format B, and BM Index B.
3-34 Element $A_i$ Element $B_i$ 0x00 Stream 32 pairs of elements.*
35-36 - - 0x00 Pipeline flush & final scaling.
37-40 - - Result [31:0] Serialized 32-bit result (MSB first).

*Note: In Packed Mode (uio_in[6]=1 in Cycle 0), the STREAM phase is reduced to 16 cycles (Cycles 3-18).

Register Layouts

The unit captures configuration and scaling data during the first three cycles of the protocol.

Cycle 0: Metadata 0 (ui_in)

Metadata 0

  • Short Protocol ([7]): 1: Reuse previous scales/formats; immediately jump to Cycle 3.
  • Debug En ([6]): 1: Enable internal probing and metadata echo at the end of the block.
  • Loopback En ([5]): 1: Direct input-to-output mapping for physical connectivity testing.
  • LNS Mode ([4:3]):
    • 0: Normal (Exact IEEE-like multiplication).
    • 1: LNS (Logarithmic Number System using Mitchell's Approximation).
    • 2: Hybrid (Standard for Block Max elements, LNS for all others).
  • NBM Offset A ([2:0]): (Standard Start only) Exponent offset for non-Block Max elements in Operand A (MX++).
Cycle 0: Metadata 1 (uio_in)

Metadata 1

  • MX+ Enable ([7]): 1: Enable OCP MX+ extensions (Repurposed exponents and Block Max tracking).
  • Packed Mode ([6]): 1: Enable Vector Packing for 4-bit formats (2 elements per byte, Cycles 3-18).
  • Overflow Mode ([5]): 0: SAT (Saturate to Max/Min), 1: WRAP (Modulo arithmetic).
  • Rounding Mode ([4:3]):
    • 0: TRN (Truncate/Towards Zero).
    • 1: CEL (Ceil/Towards $+\infty$).
    • 2: FLR (Floor/Towards $-\infty$).
    • 3: RNE (Round-to-Nearest-Ties-to-Even).
  • NBM Offset B / Format A/B ([2:0]):
    • Standard Start: NBM Offset B (Exponent offset for Operand B).
    • Short Protocol: Combined Format A & B selection.
Cycle 1: Scale A (ui_in) & Config A (uio_in)

Scale A (ui_in[7:0]):

Scale A

  • Shared Scale A: 8-bit unsigned biased exponent (UE8M0, Bias 127) applied to all elements in Operand A.

Config A (uio_in[7:0]):

Config A

  • BM Index A ([7:3]): The index (0-31) of the "Block Max" element in Operand A (used in MX+ mode).
  • Format A ([2:0]):
    • 0: E4M3, 1: E5M2, 2: E3M2, 3: E2M3, 4: E2M1, 5: INT8, 6: INT8_SYM.
Cycle 2: Scale B (ui_in) & Config B (uio_in)

Scale B (ui_in[7:0]):

Scale B

  • Shared Scale B: 8-bit unsigned biased exponent (UE8M0, Bias 127) applied to all elements in Operand B.

Config B (uio_in[7:0]):

Config B

  • BM Index B ([7:3]): The index (0-31) of the "Block Max" element in Operand B.
  • Format B ([2:0]): Independent format for Operand B (Enabled if SUPPORT_MIXED_PRECISION=1).

How to test

Basic Verification

  1. Reset: Pulse rst_n low, then set ena high.
  2. Configuration:
    • Cycle 0: Provide 0x00 on both ui_in and uio_in for standard E4M3 mode.
    • Cycle 1: Provide 0x7F (1.0 scale) on ui_in and 0x00 (E4M3) on uio_in.
    • Cycle 2: Provide 0x7F (1.0 scale) on ui_in and 0x00 (E4M3) on uio_in.
  3. Data Streaming:
    • Cycles 3-34: Provide 32 pairs of values. E.g., 0x38 (1.0 in E4M3) on both ports.
  4. Result:
    • Cycles 35-36: Wait for internal processing.
    • Cycles 37-40: Read the 32-bit signed fixed-point result on uo_out.
    • For 32 pairs of $1.0 \times 1.0$, the result should be 0x00002000 (representing 32.0 in the system's 8-bit fractional format).

Advanced Modes

  • Short Protocol: Set ui_in[7]=1 in Cycle 0 to bypass scale loading. Useful for weight-stationary kernels where scales and formats remain constant across blocks.
  • Vector Packing: Set uio_in[6]=1 in Cycle 0. Stream two 4-bit elements per byte (High nibble = Element $i+1$, Low nibble = Element $i$).

External hardware

  • Tiny Tapeout DevKit: The easiest way to interface with the chip. Use the provided MicroPython driver (test/TT_MAC_RUN.PY) for quick prototyping.
  • Sipeed Tang Nano 4K: For high-speed testing, a dedicated FPGA bitstream and Cortex-M3 testbench are provided in the repository.

IO

Port Name Description
ui_in[7:0] Operand A / Scale A Elements $A_i$ or Scale $X_A$.
uio_in[7:0] Operand B / Scale B Elements $B_i$ or Scale $X_B$.
uo_out[7:0] Result Out Serialized 32-bit dot product result.
clk Clock System clock (Target: 20MHz).
rst_n Reset Active-low asynchronous reset.
ena Enable Clock enable.

Appendix: OCP MX+ Mathematics

The OCP MX+ extension optimizes quantization by preserving high-precision "outliers" (Block Max elements) while maintaining a low bit-width for the rest of the block.

1. Base OCP MX Mathematics (Standard)

For a block of $k$ elements, the value of an element $A_i$ is given by: $V(A_i) = S \cdot M_i \cdot 2^{X_A - 127}$ Where:

  • $S$: Sign bit ($\pm 1$).
  • $M_i$: Mantissa (significand), including an implicit leading bit for subnormals.
  • $X_A$: Shared 8-bit scale (UE8M0).
  • $E_i$: Individual element exponent (for FP8/FP6/FP4 formats).

2. OCP MX+ (Extended Mantissa)

When MX+ Enable is set, the Block Max (BM) element—identified by BM Index—repurposes its exponent bits as additional mantissa.

Normal Element ($i \neq BM$): Decoded as standard MXFP (e.g., E4M3).

Block Max Element ($i = BM$):

  • Exponent: Fixed to $E_{max}$ for the selected format.
  • Mantissa: The original exponent bits are appended to the mantissa field.
  • Benefit: For FP4 (E2M1), the mantissa grows from 1 bit to 3 bits ($1 + 2$), reducing quantization error for the most critical value by up to 10x.

3. OCP MX++ (Decoupled Shared Scaling)

MX++ allows "Non-Block Max" (NBM) elements to use a finer quantization grid than the BM element by applying a secondary exponent offset.

$V(A_{i \neq BM}) = S \cdot M_i \cdot 2^{(X_A - 127) - NBM_Offset_A}$

This effectively "zooms in" on the smaller values in the block, reducing the floor noise caused by a single large outlier.

4. LNS Mitchell's Approximation

In LNS Mode, multiplication $P = A \times B$ is performed in the logarithmic domain: $\log_2(P) = \log_2(A) + \log_2(B)$

To avoid expensive Power/Log circuits, the unit uses Mitchell’s Approximation: $\log_2(1+m) \approx m, \quad m \in [0, 1)$

The product of two significands $(1+m_a)$ and $(1+m_b)$ is approximated as:

(1+m_a)(1+m_b) \approx \begin{cases} 1 + m_a + m_b & \text{if } m_a + m_b < 1 \\ 2(m_a + m_b) & \text{if } m_a + m_b \ge 1 \end{cases}

This allows the multiplier to be replaced by a simple adder and a shift, reducing hardware area by over 50%.

Thank you!

A massive thank you to Matt Venn, Uri Shaked, Sophie, and the entire Tiny Tapeout / IHP community for making open-source silicon a reality. This project was built on the foundation of your incredible tools and dedication.

IO

#InputOutputBidirectional
0data_in_a[0]data_out[0]data_in_b[0]
1data_in_a[1]data_out[1]data_in_b[1]
2data_in_a[2]data_out[2]data_in_b[2]
3data_in_a[3]data_out[3]data_in_b[3]
4data_in_a[4]data_out[4]data_in_b[4]
5data_in_a[5]data_out[5]data_in_b[5]
6data_in_a[6]data_out[6]data_in_b[6]
7data_in_a[7]data_out[7]data_in_b[7]

Chip location

Controller Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux tt_um_chip_rom (Chip ROM) tt_um_factory_test (Tiny Tapeout Factory Test) tt_um_htfab_asicle2 (Asicle v2) tt_um_htfab_caterpillar (Simon's Caterpillar) tt_um_urish_simon (Simon Says memory game) tt_um_microlane_demo (microlane demo project) tt_um_ygdes_hdsiso8_dlhq (ttihp-HDSISO8) tt_um_ygdes_hdsiso8_rs (ttihp-HDSISO8RS) tt_um_YannGuidon_TinyScanChain (TinyScanChain5L) tt_um_MichaelBell_photo_frame (Photo Frame) tt_um_digital_clock_example (7-Segment Digital Desk Clock) tt_um_miniMAC (miniMAC_5L) tt_um_tinymoa_ihp0p4_16x16 (TinyMOA-IHP0P4-16x16) tt_um_glyph_mode_hd (Glyph Mode HD) tt_um_prism_lite (ihp_cmos51_prism) tt_um_htfab_rotfpga2 (ROTFPGA v2) tt_um_SotaSoC (SotaSoC) tt_um_essen (Fast bfloat multiplication) tt_um_calonso88_spi_i2c_reg_bank (Register bank accessible through SPI and I2C) tt_um_urish_usb_cdc (USB CDC (Serial) Device) tt_um_urish_rings (VGA Rings) tt_um_toivoh_demo (Orion Iron Ion [TT08 demo competition]) tt_um_2048_vga_game (2048 sliding tile puzzle game (VGA)) tt_um_pakesson_glitcher (Glitcher) tt_um_chatelao_fp8_multiplier (OCP MXFP8 Streaming MAC Unit) tt_um_algofoogle_raybox_zero (raybox-zero TTIHP0p4 edition) tt_um_flummer_ltc (Linear Timecode (LTC) generator with I2C control) tt_um_lledoux_s3fdp_seqcomb (Pattern-Guided Arithmetic Optimizations with MLIR) tt_um_snake_game (SnakeGame) tt_um_spongent88 (Spongent-88 Hash Accelerator) tt_um_lledoux_bf16_diminished_kulisch (Pattern-Guided Arithmetic Optimizations with MLIR kulisch bf16) tt_um_float_synth_nikleberg (float_synth) tt_um_silicon_strummer (Silicon Strummer) tt_um_vga_clock (VGA clock) tt_um_urish_sic1 (SIC-1 8-bit SUBLEQ Single Instruction Computer) tt_um_algofoogle_vgaringosc (Ring osc on VGA) tt_um_tinymoa_ihp0p4_8x8 (TinyMOA-IHP0P4-8x8) tt_um_tinytapeout_logo_screensaver (VGA Screensaver with Tiny Tapeout Logo) tt_um_lisa (LISA 8-Bit Microcontroller) tt_um_nicklausthompson_twi_monitor (TWI Monitor) Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available