450 Simple MAC Engine w/ Postproc

450 : Simple MAC Engine w/ Postproc

Design render
  • Author: Luca Goddijn
  • Description: Quantized INT8 MAC engine with bias, activation, and quantizer postproc, SPI-controlled
  • GitHub repository
  • Open in 3D viewer
  • Clock: 25000000 Hz

How it works

tt_um_arty3_mac_engine is a standalone signed-INT8 multiply–accumulate (MAC) engine with an on-chip post-processing pipeline, controlled over a 4-wire SPI slave. It is intended as a co-processor for quantized neural- network inference on a small MCU: the host streams operands and bias values over SPI, issues compute commands, and reads back an INT8 result.

Block diagram

flowchart LR
    SPI["SPI slave<br/>(uio[3:0])"]
    RF["regfile"]
    FSM["cmd_fsm"]
    PE["conv_pe<br/>(MAC)"]
    PP["postproc<br/>bias &rarr; act &rarr; quant"]

    SPI <-->|"rf_addr / rf_wdata<br/>rf_rdata / we / re"| RF
    RF  <-->|"cmd_valid, cmd_code,<br/>op_a, op_b, status"| FSM
    FSM -->|"act_in, wt_in,<br/>acc_in, valid_in"| PE
    PE  -->|"acc_out"| FSM
    PE  -->|"acc_out (live + shadow)"| RF
    FSM -->|"in_data, in_valid,<br/>out_ready"| PP
    PP  -->|"out_data, out_valid<br/>(&rarr; RESULT)"| RF
    RF  -->|"bias, quant_shift,<br/>act_mode"| PP
  • conv_pe - a single signed INT8 x INT8 → INT32 MAC with a registered accumulator. One MAC per CMD_MAC, no internal storage besides the accumulator.
  • postproc - three-stage pipeline: bias_addactivation (NONE / ReLU / Leaky ReLU, slope = 1/8) → quantize (arithmetic right shift + signed saturation to INT8). Latency: 3 cycles.
  • cmd_fsm - 7-state sequencer (IDLE, MAC, CLR, PP_FEED, PP_WAIT, SOFTRST, DOT4) that owns the handshake to conv_pe and postproc.
  • regfile - 7-bit-addressed byte-wide register file. Holds operands, bias, quantizer/activation configuration, status flags, and the read ports for the accumulator and the result.
  • spi_slave - mode 0 (CPOL=0, CPHA=0), MSB-first. Every transaction is exactly 16 SCLK cycles framed by CS_N.

Register map (summary)

All registers are 8 bits. The top bit of the SPI header is R/W#; the low 7 bits are the address.

Addr Name R/W Reset Description
0x00 STATUS RO 0x01 {0000, ACC_OVF_STK, RESULT_VALID, BUSY, IDLE}
0x01 CMD WO - Writing launches a command (see below)
0x02 OP_A RW 0x00 INT8 multiplicand A
0x03 OP_B RW 0x00 INT8 multiplicand B
0x04 BIAS RW 0x00 INT8 bias added in postproc
0x05 QUANT_SHIFT RW 0x00 Arithmetic right shift (0–31) applied in postproc
0x06 ACT_MODE RW 0x00 00=None, 01=ReLU, 10=LeakyReLU(α=1/8)
0x08–0x0B ACC_B0..B3 RO 0x00 Accumulator, little-endian. Reading ACC_B0 snapshots the upper bytes into a coherent shadow.
0x0C RESULT RO 0x00 INT8 output of the last CMD_POSTPROC. Reading clears RESULT_VALID.
0x10 FEATURE_ID RO 0xA1 Engine family + revision
0x12 OP_A1 RW 0x00 Lane-1 multiplicand A (CMD_DOT4)
0x13 OP_B1 RW 0x00 Lane-1 multiplicand B (CMD_DOT4)
0x14 OP_A2 RW 0x00 Lane-2 multiplicand A (CMD_DOT4)
0x15 OP_B2 RW 0x00 Lane-2 multiplicand B (CMD_DOT4)
0x16 OP_A3 RW 0x00 Lane-3 multiplicand A (CMD_DOT4)
0x17 OP_B3 RW 0x00 Lane-3 multiplicand B (CMD_DOT4)

Commands (write to CMD):

Code Mnemonic Action
0x00 CMD_NOP No-op
0x01 CMD_MAC acc ← acc + OP_A * OP_B
0x02 CMD_CLR_ACC acc ← 0 (config registers untouched)
0x03 CMD_POSTPROC Drive acc through bias → activation → quantize; result lands in RESULT
0x04 CMD_DOT4 4-cycle burst MAC: acc ← acc + Σ OP_A{i} * OP_B{i} for i = 0..3 (lane 0 = OP_A/OP_B, lanes 1..3 = OP_A1..3/OP_B1..3)
0xFF CMD_RESET Soft reset: clears acc, RESULT, sticky flags; preserves OP_A/OP_B/BIAS/QUANT_SHIFT/ACT_MODE

Pin map

ui_in[7:0]   : reserved (tied off internally)
uo_out[0]    : STATUS.IDLE
uo_out[1]    : STATUS.BUSY
uo_out[2]    : STATUS.RESULT_VALID
uo_out[3]    : STATUS.ACC_OVF_STK
uo_out[6:4]  : cmd_fsm state[2:0]
uo_out[7]    : heartbeat (clk / 2^20, ~23.8 Hz at 25 MHz)
uio[0]       : SPI_CS_N (in,  active low)
uio[1]       : SPI_SCLK (in,  mode 0)
uio[2]       : SPI_MOSI (in)
uio[3]       : SPI_MISO (out, driven; held low while CS_N high)
uio[7:4]     : reserved (in)

uio_oe = 8'b0000_1000. MISO is not tri-stated - if you share this chip's SPI bus with other slaves you must add an external buffer.

Electrical / timing summary

  • System clock: nominal 25 MHz (40 ns), max ~50 MHz.
  • SPI clock: must be ≤ sysclk/4 (≤ 6.25 MHz at 25 MHz sysclk).
  • SPI mode: 0 (CPOL=0, CPHA=0), MSB-first, 8-bit framed, 2 bytes per transaction.
  • Reset: active-low asynchronous rst_n, synchronized internally with a 2-flop async-assert / sync-deassert synchronizer.

For full specifications - SPI bit-level timing, the regfile's read-side effects, the same-cycle MAC/ACC race definition, the test matrix, and post-silicon bring-up procedure - see docs/SPEC.md.

How to test

Hardware

You need a SPI master capable of mode-0, MSB-first transactions at ≤ sysclk/4. An MCU dev board (e.g. Raspberry Pi Pico, STM32 Nucleo, Arduino with a hardware SPI peripheral) is sufficient. A logic analyzer on uo_out[6:0] is useful for watching the FSM state and status flags without polling over SPI.

Wire-up:

Chip pin Host pin
uio[0] SPI CS (slave-select), driven by host
uio[1] SPI SCLK
uio[2] SPI MOSI
uio[3] SPI MISO (input on the host)
clk host-provided system clock (≤ 50 MHz; 25 MHz nominal)
rst_n host-controlled reset, active low

Smoke test

  1. Hold rst_n low for at least 3 system clock cycles, then release it.
  2. Read STATUS (addr 0x00). Expected value: 0x01 (IDLE = 1).
  3. Read FEATURE_ID (addr 0x10). Expected value: 0xA1.

If those two reads pass, the SPI slave, regfile, and reset synchronizer are all working.

Functional walkthrough: dot product [3, -2] · [4, 5] with ReLU

Each row below is one full 16-SCLK SPI transaction.

MOSI:  0x81 0x02     ; CMD_CLR_ACC                (acc = 0)
MOSI:  0x82 0x03     ; OP_A = 3
MOSI:  0x83 0x04     ; OP_B = 4
MOSI:  0x81 0x01     ; CMD_MAC                    (acc = 12)
MOSI:  0x82 0xFE     ; OP_A = -2
MOSI:  0x83 0x05     ; OP_B = 5
MOSI:  0x81 0x01     ; CMD_MAC                    (acc = 12 + (-10) = 2)
MOSI:  0x84 0x00     ; BIAS        = 0
MOSI:  0x85 0x00     ; QUANT_SHIFT = 0
MOSI:  0x86 0x01     ; ACT_MODE    = ReLU
MOSI:  0x81 0x03     ; CMD_POSTPROC
MOSI:  0x00 0x00     ; poll STATUS until RESULT_VALID (bit 2) = 1
MOSI:  0x0C 0x00     ; read RESULT  -> MISO returns 0x02

Between commands you can poll STATUS (bit 0 = IDLE) to check that the FSM has returned to idle. Most commands complete in a handful of system clocks; the SPI transaction itself is the dominant latency.

Reading the full INT32 accumulator

Always read the four bytes in order ACC_B0, ACC_B1, ACC_B2, ACC_B3. Reading ACC_B0 snapshots the upper three bytes into a coherent shadow; out-of-order reads return stale shadow bytes.

Status flags

  • IDLE (bit 0): FSM is in IDLE and postproc is empty.
  • BUSY (bit 1): a command is executing.
  • RESULT_VALID (bit 2): CMD_POSTPROC has finished. Cleared by reading RESULT.
  • ACC_OVF_STK (bit 3): a MAC overflowed the INT32 accumulator. Sticky until CMD_RESET or hard reset.

Simulation

The repository ships a cocotb-based test suite (Icarus Verilog) under test/. It bit-bangs the SPI interface, exercises every command, and includes saturation/overflow coverage. From test/:

make

External hardware

None required. The chip is a self-contained SPI slave; any host MCU with hardware SPI mode 0 is sufficient. A logic analyzer (Saleae, sigrok- compatible, etc.) on uo_out[6:0] is useful for bring-up but not mandatory. If sharing the SPI bus with other slaves, add an external buffer on uio[3] (MISO) because the pad is permanently driven.

Additional links

Project repository: ttgf-mac-engine Spec documentation: SPEC.md

IO

#InputOutputBidirectional
0unusedSTATUS.IDLESPI_CS_N (in)
1unusedSTATUS.BUSYSPI_SCLK (in)
2unusedSTATUS.RESULT_VALIDSPI_MOSI (in)
3unusedSTATUS.ACC_OVF_STKSPI_MISO (out)
4unusedFSM state[0]reserved (in)
5unusedFSM state[1]reserved (in)
6unusedFSM state[2]reserved (in)
7unusedheartbeat (clk/2^20)reserved (in)

Chip location

Controller Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux tt_um_chip_rom (Chip ROM) tt_um_factory_test (Tiny Tapeout Factory Test) tt_um_utoss_riscv (UTOSS RISC-V core) tt_um_memory_game_top (Number Memory Game) tt_um_danielpenas42 (Ball Display) tt_um_machinelearning (7-Segment Neural Predictor) tt_um_microlane_demo (microlane demo project) tt_um_pixel_processor (Tiny Pixel Processor) tt_um_jpigdon_gps_accelerator_top (GPS_Accelerator) tt_um_rgb_mixer (rgb_mixer) tt_um_bgao43 (Tiny TPU Systolic Array) tt_um_main (Pong in Verilog) tt_um_joannec34_teenytpu (teenytpu) tt_um_apa102_ws2812_squidgeefish (APA102 to WS2812 Translator) tt_um_uacj_bouncing_DVD_screensaver (Custom DVD Screensaver for VGA) tt_um_logoUACJ_MOGA (VGA_screensaver_UACJ) tt_um_grace_spi_led_driver (SPI-Controlled 8-Channel LED Driver) tt_um_rebeccargb_universal_decoder (Universal Binary to Segment Decoder) tt_um_rebeccargb_hardware_utf8 (Hardware UTF Encoder/Decoder) tt_um_happyhop_deadcast2 (happyhop) tt_um_dino7 (Dino-7: 7-Segment Runner Game) tt_um_arty3_mac_engine (Simple MAC Engine w/ Postproc) tt_um_uacj (Custom DVD Screensaver for VGA) tt_um_algofoogle_dottee (DOTTEE VGA demo (TTGF26a)) tt_um_mattvenn_signal_generator (Simple Signal Generator) tt_um_urish_simon (Simon Says memory game) tt_um_tpu (Tensor Processing Unit For GF) tt_um_gojimmypi_ttgf_UART_FSM_TRNG_Lab (Hardware Entropy Explorer: UART/SPI TRNG and PUF) tt_um_wokwi_465483277165299713 (First Tinytapeout) tt_um_prem_pipeline_test (Programmable_Pipeline-RISC-V) tt_um_wokwi_467219410242853889 (Tiny Tapeout testtest 111233) tt_um_wokwi_465549494272929793 (Pacos first design) tt_um_wokwi_465731371445677057 (Arturo's first Wokwi design) tt_um_wokwi_465732744934845441 (Tiny Tapeout Template_1234) tt_um_wokwi_465736492859711489 (Tiny Tapeout Workshop JuanF) tt_um_wokwi_465731430225727489 (Rafa’s first Wokwi design) tt_um_wokwi_465731458365332481 (7 segment Display Fli-Flop Try-out) tt_um_wokwi_465732744245929985 (DiseñoCursoTiny) tt_um_wokwi_465731490568160257 (Matt’s first Wokwi design) tt_um_wokwi_465736691688630273 (test1) tt_um_wokwi_465731458628527105 (Mi copia del Tiny Tapeout) tt_um_wokwi_465731520738845697 (El primer diseño) tt_um_wokwi_465731521356457985 (Tiny Tapeout Template Copy) tt_um_gen1_digital_companion_tile (Gen1 Digital Companion Tile) tt_um_wokwi_465732827753495553 (Tiny Tapeout Template Ayman) tt_um_wokwi_465731394728267777 (Julian_Proyecto) tt_um_wokwi_465731458535202817 (Tiny Tapeout Template Copy) tt_um_wokwi_465732847401723905 (Basic Circuit) tt_um_wokwi_465731452481768449 (El primer diseño de Matt para Wokwi) tt_um_wokwi_465731502018614273 (Tiny Tapeout Template flip flop) tt_um_wokwi_465732616714924033 (Tiny Tapeout RJAP) tt_um_wokwi_465731575275296769 (ocxpkeWokwiDesign) tt_um_wokwi_465732880722332673 (Pedro Template) tt_um_wokwi_465731858252480513 (Paula's first Wokwi design) tt_um_wokwi_465731455677830145 (Tiny Tapeout JMCG) tt_um_wokwi_465737601403996161 (Tiny Number Simon) tt_um_ttmul (Balanced Ternary Multiplier) tt_um_wokwi_465731466664816641 (Tiny Tapeout Workshop Malaga 2jun2026) tt_um_8bit_risc_cpu (8-bit RISC CPU) tt_um_wokwi_451184391728659457 (Simple Sprinkler) tt_um_fhw_appel_spiPWMio (spiPWMio) tt_um_divadnauj_GB_serv_soc_wb (serv_soc_wb) tt_um_8bitcustomcomputer (SAP 8 Bit Computer) tt_um_bioimpedance (Very Low Resource Digital Implementation of Bioimpedance Analysis) tt_um_mgj_bist8 (BIST-8: Built-In Self-Test for 8-bit CLA Adder) tt_um_roberto_tiny_radar_tile (BioPulse Tile) tt_um_systolic_mac_2x2 (2x2 Systolic Array Matrix Multiplier) tt_um_peg_top (2x2 CNN Accelerator PE Grid with UART) tt_um_AlvaroRub_ringcounter (Counter16Outputs) tt_um_wokwi_465731440267947009 (Antonio's first Wokwi design) tt_um_wokwi_465732706576877569 (Guille's first Wokwi design.) tt_um_wokwi_465731481873367041 (MIPS-Lite 8-bit Processor) tt_um_wokwi_465736612213902337 (Juan`s first Worki design) tt_um_wokwi_465731439156454401 (Rhyloo’s first Wokwi design) tt_um_wokwi_465732536551273473 (Tiny Tapeout Marcos Fernandez) tt_um_wokwi_465737290543084545 (Tiny Tapeout Template) tt_um_wokwi_465630130495825921 (ram 1 bit Copy) tt_um_wokwi_465731403724006401 (sdft wokwi 1) tt_um_top (RHD2164-MCU-SPI Bridge) tt_um_line_follower_arvaloez (Line Follower Robot controller) tt_um_xoroshiro64plus_v2 (xoroshiro64) tt_um_ohuettenhofer_tiny_qsim (Tiny Quantum Circuit Simulator) tt_um_santhosh_ring_osc_gf (Ring Oscillator PVT Sensor & TRNG (GF180)) tt_um_santhosh_stoch_stdp_pair_gf (Stochastic neuron + STDP controller (merged, GF180)) tt_um_santhosh_rsd_char_gf (RRAM Characterization Platform (DC sweep + endurance + retention + histogram, GF180)) tt_um_santhosh_xbar_ctrl_gf (Memristive Crossbar Peripheral Controller (GF180)) tt_um_joseph_bf (BF) tt_um_hydrocomms (FSK Modem) tt_um_systolic_array (2x2 MAC Systolic array with DFT) tt_um_kluterirv_rv32e_core (Minimal RV32E SoC with UART Loader) tt_um_algofoogle_ttgf26a_vco (VCO driven by DAC) tt_um_fer_logo_music_vga (UNIZG-FER VGA project) tt_um_maqsudbek_dyadic_pwm (Dyadic PWM) tt_um_waferspace_vga_screensaver (Wafer.space Logo VGA Screensaver) tt_um_htfab_vga_tester (Video mode tester)