119 AI Decelerator

119 : AI Decelerator

  • Author: machinaut
  • Description: Systolic array for matrix multiplication
  • GitHub repository
  • Clock: 50000000 Hz

How it works

Implements a 2x2 outer product, such that it can stream accumulate a product of a 2xN and Nx2 matrix.

It's pipelined to run 4-clock-cycle blocks, and has a single 4-stage floating point multiply-accumulate.

Chip diagram:

This is designed to be placed in a grid of tiles, with each element having 4+1 wires connecting each side.

Columns move data from north to south (inputs to outputs), and rows move data west to east (inputs to outputs).

The wires between each tile is 4 data wires and 1 control wire, and are all 1-directional.

Shared clock and reset are not shown, but are assumed to be connected to all tiles.

(See inputs/outputs at the bottom for pin assignment)

                   col_in[4]  
                   |   col_ctrl_in
                   |   |
                   v   v
                 |--------|
    row_in[4] -> |        | -> row_out[4]
  row_ctrl_in -> |        | -> row_ctrl_out
                 |--------|
                   |   |
                   |   v
                   v   col_ctrl_out
                   col_out[4]

Block timing diagram:

We have 4 clock cycles per block, which advances at the positive edge of each clock.

We'll use "block" to denote the whole block, and "count" to count the cycles within a block.

Note the control inputs are always passed through exactly from inputs to outputs on the next block. However data outputs might be different from data inputs (see modes below).

Key: ci = col_in, ri = row_in, cc = col_ctrl, rc = row_ctrl, co = col_out, ro = row_out

       Block | 0       | 1       | 2       | ...
       Count | 0 1 2 3 | 0 1 2 3 | 0 1 2 3 | ...
-------------|---------|---------|---------|-----
      col_in | blk0_ci | blk1_ci | blk2_ci | ...
 col_ctrl_in | blk0_cc | blk1_cc | blk2_cc | ...
      row_in | blk0_ri | blk1_ri | blk2_ri | ...
 row_ctrl_in | blk0_rc | blk1_rc | blk2_rc | ...
-------------|---------|---------|---------|-----
     col_out |         | blk0_co | blk1_co | ...
col_ctrl_out |         | blk0_cc | blk1_cc | ...
     row_out |         | blk0_ro | blk1_ro | ...
row_ctrl_out |         | blk0_rc | blk1_rc | ...

Modes:

The shared control bits (from both row and column) decide what to do with the data in the block.

col_ctrl | row_ctrl | mode
---------|----------|------------
0000     | 0000     | passthrough
0WX0     | 1YZ0     | multiply-accumulate
1000     | 0100     | read-write accumulator 0
1100     | 0000     | read-write accumulator 1

Passthrough Mode will pass through data unchanged (current block data will be sent out as the next block).

Multiply-accumulate Mode will interpret the input data as FP8 vectors, and multiply them and accumulate them (see math below). This mode will also pass through the data unchanged (current block data will be sent out as the next block). W, X, Y, Z specify the FP8 format for the inputs (see below).

Read-write accumulator modes will shift input with the accumulator data and output data. This is used to simultaneously read-out the current accumulator state, and write-in the next accumulator state.

RW mode | cci  | rci  | ci      | ri      | co (next) | ro (next)
--------|------|------|---------|---------|-----------|---------------
      0 | 0100 | 1000 | C0 (new)| C1 (new)| C0 (prev) | C1 (prev)
      1 | 0000 | 1100 | C2 (new)| C3 (new)| C2 (prev) | C3 (prev)

Multiply-Accumulate Math:

We interpret the column data as vector A0, A1, and the bits of the control input specify the FP8 format of A0, A1. Ditto for row data and B0, B1. The format bit is 0 if the value is E5M2 and 1 if the value is E4M4. See this paper for details on the FP8 formats: https://arxiv.org/pdf/2209.05433.pdf

The accumulators (C0, C1, C2, C3) are all FP16.

           A0      A1
           |       |
           v       v
        |-------|-------|
  B0 -> |  C0   |  C1   | -> B0
        |-------|-------|  (prev)
  B1 -> |  C2   |  C3   | -> B1
        |-------|-------|
           |         |
           v         v
           A0 (prev) A1

Systolic Tiling:

Each block controls what should happen in the following block.

Notionally, this could be used in a systolic tile pattern of N * M tiles, moving data along columns and rows. This hasn't been tested. Note that this still works with reading and writing accumulators since all the values are shifted block by block along the columns and rows.

How to test

I have no idea what clock speeds are safe for this, so probably start out slow and work your way up until there are glitches. (This is like only 4 pipeline stages for a full multiply-accumulate, so it has some nasty propagation chains)

To compute A * B + C, where A is a 2xN matrix, B is a Nx2 matrix, and C is a 2x2 matrix, do the following: (A and B can be mixtures of FP8 formats, and C is FP16)

Use the read-write accumulator mode to write in C over two blocks (skip this if you want to start with C = 0)

Block 0:
    col_in: C_0,0 (FP16)
    row_in: C_0,1 (FP16)
    col_ctrl_in: b1000
    row_ctrl_in: b0100
Block 1:
    col_in: C_1,0 (FP16)
    row_in: C_1,1 (FP16)
    col_ctrl_in: b1100
    row_ctrl_in: b0000

Then use the multiply-accumulate mode for N blocks

Block K:
    col_in: A_0,K A_1,K (FP8, FP8)
    row_in: B_K,0 B_K,1 (FP8, FP8)
    col_ctrl_in: b0WX0 (where W, X are FP8 format bits for A0, A1)
    row_ctrl_in: b1YZ0 (where Y, Z are FP8 format bits for B0, B1)

Finally read out the result from the accumulator, just like you wrote it in

Block 0:  (Note we care about the outputs here)
    col_ctrl_in: b1000
    row_ctrl_in: b0100
    col_out: C_0,0 (FP16)
    row_out: C_0,1 (FP16)
Block 1:  (Note we care about the outputs here)
    col_ctrl_in: b1100
    row_ctrl_in: b0000
    col_out: C_1,0 (FP16)
    row_out: C_1,1 (FP16)

Example:

I have a 2xK matrix A, with values A00 through A1K, and a Kx2 matrix B, with values BK0 through BK1, and a 2x2 matrix C with values C00 through C11. We want to compute A * B + C = D where D is a 2x2 matrix with values D00 through D11.

The basic steps are

  • Write in the initial value of C
  • Stream in the values of A and B, and multiply-accumulate
  • Read out the accumulated result, and call it D

For simplicity, we'll assume all the A and B values are E5M2 format, but remember they can be configured per-value.

C and D are both in FP16 format.

Remember that each block is four clock cycles, and each clock cycle 1/4 of the inputs and outputs are transmitted.

Block | col_in | row_in | cci  | rci  | col_out | row_out
------|--------|--------|------|------|---------|---------
0     | C00    | C01    | 1000 | 0100 |         |
1     | C10    | C11    | 1100 | 0000 |         |
2     | A00A10 | B00B01 | 0000 | 1000 |         |
3     | A01A11 | B10B11 | 0000 | 1000 |         |
...   | ...    | ...    | ...  | ...  |         |
K + 2 | A0KA1K | BK0BK1 | 0000 | 1000 |         |
K + 3 |        |        | 1000 | 0100 | D00     | D01
K + 4 |        |        | 1100 | 0000 | D10     | D11

So we can compute this in just K + 4 blocks, where K is the inner size of the A and B matrix product.

(Also we can save 2 blocks if C is zero, since that's the reset value of the accumulators)

IO

#InputOutputBidirectional
0Row Data In 0Row Data Out 0Row Control Out
1Row Data In 1Row Data Out 1Col Control Out
2Row Data In 2Row Data Out 2Row Control In
3Row Data In 3Row Data Out 3Col Control In
4Column Data In 0Column Data Out 0
5Column Data In 1Column Data Out 1
6Column Data In 2Column Data Out 2
7Column Data In 3Column Data Out 3

Chip location

Controller Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux tt_um_chip_rom (Chip ROM) tt_um_factory_test (TinyTapeout 04 Factory Test) tt_um_vga_clock (VGA clock) tt_um_seven_segment_seconds (7 segment seconds) tt_um_marno_factorize (Number Factorizer) tt_um_sorter (Odd even sorter) tt_um_bulls_and_cows (The Bulls and Cows game) tt_um_loopback (TinyTapeout 04 Loopback Test Module) tt_um_devinatkin_arduino_vga (VGA Output for Arduino) tt_um_wokwi_371604537887211521 (Digital Cipher & Interlock System) tt_um_urish_simon (Simon Says game) tt_um_wokwi_372184284115580929 (YKM 7-seg driver) tt_um_currymottled (Configurable PID Block) tt_um_yeokm1_pwm_audio (PWM audio) tt_um_dcb277_ALU (4-bit ALU) tt_um_rgb_mixer (RGB Mixer) tt_um_loopback (TinyTapeout 04 Loopback Test Module) tt_um_algofoogle_raybox_zero (raybox-zero) tt_um_morningjava_top (ChipTune) tt_um_loopback (TinyTapeout 04 Loopback Test Module) tt_um_eldritch_pwm_peripheral (OpenSource PWM Peripheral) tt_um_phansel_laplace_lut (Experiment Number Six: Laplace LUT) tt_um_ks_pyamnihc (Karplus-Strong String Synthesis) tt_um_dlmiles_tt04_poc_usbdev (USB Device) tt_um_thorkn_pwmaudio (Audio-PWM-Synth) tt_um_wokwi_374029622762967041 (German Traffic Light) tt_um_dandy_dance (Dandy VGA) tt_um_robojan_top (Tiny Breakout) tt_um_vc_cpu (VC 16-bit CPU) tt_um_MichaelBell_nanoV (Risc-V Nano V) tt_um_urish_usb_cdc (USB CDC (Serial)) tt_um_tiny_processor (Tiny processor) tt_um_f_hal_fft (fft-4-tt) tt_um_tomkeddie_a (LED Panel Driver) tt_um_wokwi_370722051572189185 (OSU Counter) tt_um_wokwi_370796071922577409 (Even digits) tt_um_wokwi_370709383347782657 (Traffic light) tt_um_wokwi_371425977920989185 (Tutorial4) tt_um_riceshelley_tinyFPGA (Grain-Flex-FPGA) tt_um_mgyenik_bfcpu (BFCPU) tt_um_machinaut_systolic (AI Decelerator) tt_um_wokwi_374140166551523329 (Tiny (3-bit) LFSR) tt_um_jk2102 (Pulsed Plasma Thruster (PPT) Controller) tt_um_jayraj4021_SAP1_cpu (SAP-1 CPU) tt_um_wokwi_370533670565165057 (Impulse counter) tt_um_ashleyjr_delay_line (Delay Line) tt_um_simplepiano (Simple Piano) tt_um_wokwi_374292646686728193 (Ripple-Carry Adder) tt_um_led_multiplexer_display (Led Multiplexer Display) tt_um_mjbella_led_matrix_driver (LED Matrix Driver) tt_um_fifo_stevej (8-bit FIFO with depth 16.) tt_um_robojan_pong_top (Pong) tt_um_wokwi_374962052813090817 (8 panel display"") tt_um_wokwi_370690644715216897 (Traffic Light) tt_um_wokwi_374494377414857729 (Model Railway turntable polarity controller) tt_um_wokwi_347144898258928211 (Customizable UART string tx) tt_um_wokwi_347497504164545108 (7-Seg 'Tiny Tapeout' Display) tt_um_wokwi_347140425276981843 (UART character tx) tt_um_wokwi_347417602591556180 (Padlock) tt_um_noritsuna_8bitcounter_AI (8bits Counter by AI) tt_um_fm_transmitter (FM Transmitter) tt_um_wokwi_369864099838656513 (Test 4x4 memory) tt_um_htfab_rotfpga2 (ROTFPGA v2) tt_um_losaias (Arithmetic logic unit of four operations between two 8-bit numbers) tt_um_fir_top (FIR Filter) tt_um_santacrc_tamagotchi (Tamagotchi) tt_um_4_LUT_Baungarten (LFMPDM (Lightning Fast Matrix Programmable Design Module)) tt_um_RELOG_10M_Juan_Garcial (7 SEGMENTS CLOCK) tt_um_MultiPatternLEDSequencer_RSYO3000 (Multi Pattern LED Sequencer) tt_um_pwm (Generador de PWM) tt_um_chip_SP_measure_delay (Multi stage path for delay measurements.) tt_um_chip_SP_Soy_de_Zacapa (ASCII Text Printer Circuit) tt_um_fing_synchronizer_hga (Clock synchronizer) tt_um_db_PWM (Simple PWM Generator) tt_um_RS_Vfreq (CLK Frequency Divider) tt_um_ja_TrafficLight (UIS Traffic Light) tt_um_adder_NestorMatajira (4 bit adder ) tt_um_ALU_NicolasOrcasitas (8-bit ALU) tt_um_ccollatz_SergioOliveros (Collatz Conjecture) tt_um_wokwi_370011087462055937 (8 bit 4 data sorting network) tt_um_wokwi_374968111036708865 (BCD to 7 segments) tt_um_wokwi_374967675785369601 (4 bit full adder) tt_um_wokwi_374974793636964353 (Circuito Religioso) tt_um_wokwi_374815911155542017 (Demultiplexor NAND) tt_um_wokwi_374903567624066049 (Sumador/Sustractor de 3 bit con acarreo y prestamo) tt_um_wokwi_374515580784897025 (Hardware Lock) tt_um_wokwi_374909346558831617 (Custom falling and rising edge detection) tt_um_alu (4-bit-alu) tt_um_pong_neopixel (Angardo's pong) tt_um_LEOGLM_hamming_code_top ((11,7) hamming code encoder and decoder with UART) tt_um_adriannovosel_top (Impulse counter) tt_um_wokwi_374969806854695937 (State machine of an impulse counter) tt_um_wokwi_375061599421794305 (Logic Circuit 1) tt_um_biased_trng (Variable Duty-Cycle TRNG) tt_um_top (Modem Multimodo) tt_um_wokwi_372347167704674305 (SAR ADC Backend) tt_um_wokwi_375176944142127105 (FCFM 7-segment display) tt_um_rodrigomunoz1_rotempsensor_top (another ring oscillator based temperature sensor) tt_um_USM_temp_sens_hyst (RO-based temperature sensor with hysteresis) tt_um_FSM (Microrobotics FSM) tt_um_wokwi_375042398768251905 (MINI ALU) tt_um_wokwi_374778387606763521 (PWM Quisquilloso) tt_um_CPU (CPU 8 bit) tt_um_simple_processor_pablopabota (A Risc-V Instruction memory i2c programmer) tt_um_wokwi_375246321309880321 (IFSC 6-bit Locker) tt_um_wokwi_375217288209912833 (Randomizer and status checker) tt_um_wokwi_375245713375900673 (Simulador de cruzamento de semáforo) tt_um_wokwi_374636462642973697 (Full_adder_carry_juang_garzons) tt_um_ternaryPC_radixconvert (4-trit balanced ternary program counter and convertor) tt_um_darkfsegura_collatz (uDATAPATH_Collatz) tt_um_wokwi_375174630101280769 (Adder) tt_um_wokwi_375163050120587265 (Binary to 7 segment) tt_um_wokwi_375165100039571457 (Neuron) tt_um_silva (Later) tt_um_shift (serializer) tt_um_alu4_alonso59 (4-bits 1-channel PWM and ALU 4 bits) tt_um_mod_u_cnt_BCD (up-down counter with parallel load and BCD output) tt_um_ciro (Later) tt_um_control (Contador con carga) tt_um_onehot (onehot_decoder) tt_um_santiago (CDMA Transmitter/Receiver) tt_um_divider_urielcho (clock divider) tt_um_pwl_RaulprTech (reciprocal) tt_um_fabian (Later) tt_um_thezoq2_tmng (Time Multiplexed Nand-gate) tt_um_uninorte (Octal classifier) tt_um_dlmiles_muldiv4 (MULDIV unit (4-bit signed/unsigned)) tt_um_rs_write_decodifier_fjrn_cinvestav (RS Write Decodifier) tt_um_BounceFSM_RSX92 (Password FSM) tt_um_priority_decoder_Juan_Garcial (Priority e) tt_um_FreqMeter_Juan_Garcial (frecuencimeter) tt_um_sahrdayalfsr (lfsr random number generator) tt_um_i2c (i2c_6 bits) tt_um_wokwi_375227079413963777 (Fastest Finger) tt_um_wokwi_375300958229329921 (Fastest Finger (Clocked)) tt_um_zeptobars (Oscillators II) tt_um_rebot449_lingret_ALU_Top (Simple ALU) tt_um_loopback (TinyTapeout 04 Loopback Test Module) tt_um_wokwi_375288605206694913 (Adjustable Frequency LED Chaser) tt_um_wokwi_375310871188385793 (Simple QSPI DAC) tt_um_quardinlyttle_top (AQALU) tt_um_wokwi_375326293008530433 (Simple TMR) tt_um_chiplet_jtag (Poor Person's Boundary Scan) tt_um_wokwi_375248885704300545 (Probador de lógica básico) tt_um_rejunity_telluride2023_neuron (LIF Neuron, Telluride 2023) tt_um_kpwebb_adder (rusty_adder) Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available