581 Aswarby INT8 MAC

581 : Aswarby INT8 MAC

Design render

How it works

This is a weight-stationary signed-INT8 multiply-accumulate (MAC) engine — the single compute primitive at the heart of every quantized convolution layer. A weight is loaded once and held "stationary" while a stream of activation bytes is multiplied into a 32-bit accumulator:

acc = clamp_int32( acc + weight * activation )

Both operands are signed 8-bit (two's complement, range −128..127). The product is signed 16-bit; it is added into a signed 32-bit accumulator whose add saturates at the INT32 limits (+2147483647 / −2147483648) so overflow is well-defined, exactly as in real fixed-point inference hardware. A sticky ovf flag records whether saturation has ever occurred since the last clear.

Internally the design is two small modules:

  • mac_core — the datapath: weight register, signed 8×8 multiplier, saturating 32-bit accumulator, and a combinational byte-select mux for readout. The MAC is 4-stage pipelined — split-multiply, reconstruct product, 33-bit add, then saturate/clamp, each in its own cycle — so the result commits to the accumulator four cycles after the command is accepted. (The multiply is split into two parallel nibble-products because a full 8×8 multiply alone is too long a path for 50 MHz on the 180 nm node.)
  • mac_fsm — a 2-state controller that converts each rising edge of strobe into a single-cycle execute pulse (one strobe = exactly one operation, no repeated accumulation while strobe is held high) and raises done once the pipelined result has committed.

Everything is fully synchronous to clk, single clock domain, active-low reset. The pipeline keeps each stage's logic to a single operation so the design closes timing at the 50 MHz tile target on the 180 nm GF180 process (a single-cycle multiply-and-accumulate does not — see the project notes). Per-operation latency is hidden behind done, so the host protocol is unchanged.

Pin map

Pins Dir Name Meaning
ui_in[7:0] in data signed INT8 operand (weight or activation)
uio_in[1:0] in cmd 00 NOP · 01 load weight · 10 MAC · 11 clear
uio_in[2] in strobe rising edge executes one command
uio_in[4:3] in rd_sel which accumulator byte appears on uo_out (0=LSB … 3=MSB)
uo_out[7:0] out acc_byte selected accumulator byte
uio_out[5] out done one-cycle completion pulse
uio_out[6] out ovf sticky saturation flag

How to test

Each operation is a three-step handshake from the Commander:

  1. Drive ui_in (data) and uio_in[1:0] (command), with strobe low.
  2. Raise strobe (uio_in[2]); the engine executes on the rising edge and pulses done (uio_out[5]) one cycle later.
  3. Lower strobe to re-arm for the next operation.

To read the 32-bit result, set rd_sel (uio_in[4:3]) to 0,1,2,3 in turn and read uo_out each time; concatenate as little-endian to recover the signed INT32 accumulator.

Worked example (compute 3·5 + 3·5 = 30):

Step cmd data result
load weight 01 3 weight = 3
MAC 10 5 acc = 15
MAC 10 5 acc = 30
read 00 rd_sel 0→3 1E 00 00 00 → 30

The cocotb suite in test/ drives exactly this protocol and checks every result against a Python golden model, including signed operands, clear, byte-streamed readout, a 150-step randomized sequence, and both saturation directions (the exhaustive saturation tests are gated behind SKIP_SLOW=1).

Vectors can be generated with tools/export_vectors.py, which can also pull a real INT8 weight/activation row from a quantized detector layer so the silicon is exercised with mission-representative data.

External hardware

None. The design is driven entirely from the Tiny Tapeout demo board (RP2040 + Commander); no PMOD or external parts required.

IO

#InputOutputBidirectional
0data[0] (signed INT8 in)acc_byte[0] (selected accumulator byte)cmd[0] (in)
1data[1]acc_byte[1]cmd[1] (in)
2data[2]acc_byte[2]strobe (in)
3data[3]acc_byte[3]rd_sel[0] (in)
4data[4]acc_byte[4]rd_sel[1] (in)
5data[5]acc_byte[5]done (out)
6data[6]acc_byte[6]ovf (out)
7data[7]acc_byte[7]

Chip location

Controller Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux tt_um_chip_rom (Chip ROM) tt_um_factory_test (Tiny Tapeout Factory Test) tt_um_Vincent2405_adder_tree (BSD Convolution Adder Tree) tt_um_BastiBudde_i2c_slave_sensor (I2C Slave Template with Emulated Sensor) tt_um_60hz_load (60 Hz Grid-Forming ASIC with Dump-Load Control) tt_um_spi_config_reg (Simple SPI configuration for analog designs) tt_um_ex_drosen766 (Project) tt_um_spi_cpu_top (SPI-CPU) tt_um_d5smith_mfa (Music for ASICs) tt_um_i2c_master (I2C Master Controller) tt_um_aswarby_mac (Aswarby INT8 MAC) tt_um_arrakeen_spsram_direct (TT-Arrakeen-SPSRAM-direct) tt_um_alu (8-bit Interactive ALU) tt_um_JCT_PoC (ttgf jct PoC) tt_um_jct_lea (LEA-128) tt_um_cwru_cpu (CWRU CPU) tt_um_teapot (100Mbps Ethernet Accelerator Wrapper) tt_um_jte_cordic (CORDIC sin/cos generator) tt_um_aidenkoch4 (Three Channel RGB PWM Controller) tt_um_pschuetz_tremolo (Tremolo guitar pedal ASIC) tt_um_jsabree11_fibonacci_checker (fibbonaci_tt) tt_um_connerdaehler_boop (Procedural ASIC) tt_um_Kieckenwama_Traffic_LIGHT_FSM (Traffic Light FSM) tt_um_KimLuu02_WashingMachine_FSM (WashingMachine_FSM) tt_um_PaulineKreis_PWM_Analyser (PWM-Analyser) tt_um_PWM (PWM Generator) tt_um_wokwi_466666882406199297 (Simple Sprinkler) tt_um_rebeccargb_universal_decoder (Universal Binary to Segment Decoder) tt_um_rebeccargb_hardware_utf8 (Hardware UTF Encoder/Decoder) tt_um_spi_master (SPI Master Slave Communication) tt_um_likitha_trng (Secure TRNG Entropy Generator) tt_um_wnn (8-bit WNN Pattern Recognizer) tt_um_raksha (Raksha) tt_um_uart_soc (UART_SOC) tt_um_ecdsa_verify (ECDSA Verification) tt_um_ecc_processor (ECC Processor) tt_um_fast_auth (Fast Authentication Accelerator) tt_um_karthik_trng (TRNG using Ring Oscillator) tt_um_push (Secure V2X Mini Demonstrator) tt_um_santosh_aes_sbox (AES S-Box Accelerator) tt_um_hardware_anomaly_detection (Hardware Anomaly Detection) tt_um_multi_protocol (Multi-Protocol Communication Controller) tt_um_pqc_ntt_butterfly (PQC NTT Butterfly Core) tt_um_cambridge_nlfsr (Programmable Chaotic NLFSR) tt_um_4b_accumulator_cpu (4 bit Accumulator CPU) tt_um_spi_slave (SPI Slave with 8-Register File) tt_um_geeta_doddamani_lfsr (4-bit Maximum-Length LFSR) tt_um_ecc_accelerator (ECC Scalar Accelerator) tt_um_egurapha_chacha20 (ChaCha20) tt_um_configurable_pwm (Configurable PWM Generator) tt_um_Arctic0 (Arctic0 16-bit CPU) tt_um_comp8 (8-bit Comparator) tt_um_pwm_cit (Configurable 8-bit PWM Generator) tt_um_rameshwar_door_lock (Digital Door Lock) tt_um_sandy_venky (8-bit LFSR Circuit) tt_um_ljhahne_pong (Pong) tt_um_v2x_warning (V2X Collision Warning) tt_um_ecc_scalar_mult (ECC Scalar Multiplication) tt_um_fhw_appel_spiPWMio (spiPWMio) tt_um_arrakeen_spsram_direct_sramrules (TT-Arrakeen-SPSRAM-direct-sramrules) tt_um_arrakeen_spsram_direct_5v (TT-Arrakeen-SPSRAM-direct-5V) tt_um_LukeSilva_cartrip (Car Trip) tt_um_coffeepot (100Mpbs 3 port Ethernet switch) tt_um_emiliopeju_lightscan (Lightscan) tt_um_Alanduan21_triad01_top (triad01) tt_um_lif_snn (4-Neuron LIF Spiking Neural Network) tt_um_smerity_mandelbrot (Smerity-Mandelbrot) tt_um_elvtide01_7SegmentDice (7SegmentDice) tt_um_elemental_harmony (Elemental Harmony Game) tt_um_pattern_gen (Programmable Waveform and PWM Generator) tt_um_antimatter15_pdm_vad (PDM Voice Activity Detector) tt_um_layla_spike_detector (Neural Spike Detector) tt_um_detronyx_arith_lab (Detronyx Arithmetic Lab Tile) tt_um_hasheddan_nni (Nearest Neighbor Interpolation) tt_um_brisq (BRISQ) tt_um_santhosh_spike_codec_gf (Neuromorphic Spike Codec (GF180)) tt_um_santhosh_aer_router_gf (Asynchronous-AER Spike Router (4-phase REQ/ACK, 16-entry routing table, GF180)) tt_um_santhosh_snn_wta_gf (Spiking Neural Network WTA Inference Engine (GF180)) tt_um_santhosh_cim_bist_gf (CIM Controller with BIST and Fault Map (GF180)) tt_um_santhosh_neuro_puf_gf (Neuromorphic PUF (distinct-tap LFSR arbiter + memristor XOR, GF180)) tt_um_detronyx_uart_trace_exerciser (Detronyx UART Trace Exerciser) tt_um_ro_puf (Tiny RIng Oscillator PUF) tt_um_franretfie_top (Quadrature sine generator) tt_um_cherny_xor_8bi (XORing given bits) tt_um_mealycpp_ascon_sdmc_uart (ASCON Integrated Crypto Processor) tt_um_reflex_s4 (AER Reflex Chip - MCP2515 CAN gateway) tt_um_polytrig_core (PolyTrig Digital Waveform Synthesis Core) tt_um_waferspace_vga_screensaver (Wafer.space Logo VGA Screensaver) tt_um_2048_vga_game (2048 sliding tile puzzle game (VGA)) tt_um_urish_simon (Simon Says memory game) Available