642 8-bit Vector Compute-in-SRAM

642 : 8-bit Vector Compute-in-SRAM

How it Works

This design is a vector multiplier with stationary weights implemented on 4 tiles in tiny tapeout. The entire design, tests, and documentation took around 10 hours. It contains 8 multiply-and-add (MAC) units, each equipped with two registers, and an adder tree that sums the multiplication results of an 8-element vector without the loss of precision. The design allows for weights and activations to be programmed separately into each MAC unit using specific operation codes (OPs). The weights remain stationary, meaning they are programmed once and reused for multiple activation inputs.

Components and Operation

  1. Multiply-and-Add (MAC) Units:

    • Each MAC unit consists of two registers: one for weights (W) and one for activations (A). The MAC unit performs the multiplication of these two values and stores the result.
    • The weights and activations are loaded into the MAC units using distinct OP codes (LOAD_W and LOAD_A). The LOAD_W OP code (0b00) loads the weight into the MAC unit, while the LOAD_A OP code (0b01) loads the activation value.
    • After the weights and activations are loaded, the MAC unit multiplies these values and stores the result for further processing by the adder tree.
  2. Adder Tree:

    • The adder tree is responsible for summing the outputs of the 8 MAC units. It ensures that the multiplication results are accumulated accurately without precision loss.
    • The adder tree is structured hierarchically in three levels to sum the results efficiently.
      • Level 1: Adds pairs of MAC outputs.
      • Level 2: Adds the results of Level 1.
      • Level 3: Adds the results of Level 2 to produce the final sum.
    • This hierarchical structure ensures that the final sum of all MAC outputs is computed correctly and efficiently.
  3. Readout Mechanism:

    • Due to the limited width of the output interface (8 bits), the readout mechanism reads the final result of the adder tree (s_adder_tree) over multiple clock cycles.
    • The READ_S OP code (0b10) is used to initiate the readout process. The final sum is split into three 8-bit chunks, which are read sequentially.
    • The readout process ensures that the most significant bits (MSBs) are read first, followed by the least significant bits (LSBs), reconstructing the complete 19-bit result correctly.
  4. Programming and Operation:

    • Loading Weights and Activations: The design supports separate loading of weights and activations. Each MAC unit is addressed individually, and the corresponding values are loaded using the appropriate OP codes.
    • Vector Multiplication: Once the weights and activations are loaded, the MAC units perform the multiplication, and the results are summed by the adder tree.
    • Result Readout: The final sum is read out using the READ_S OP code, ensuring the complete result is available for further processing or verification.

This design efficiently handles the multiplication and accumulation of vectors, ensuring high precision and accuracy despite the limitations in the size (even with 4 tiles). It provides a robust mechanism for programming and reading the results, making it suitable for various applications requiring vector multiplications.

How to Test

There are several tests under test/test.py that would help anyone to understand how the design works. Due to the limitation of access to the external signals, most of the tests are commented out. Each test also contains its own commented Verilog code since some tests were initially developed to test and verify individual units such as MACs and adders. There are four categories of tests:

  • Single MAC operation test: This test verifies the basic functionality of a single MAC unit, ensuring that it correctly performs multiplication and stores the result.
  • Multiple MAC units loading weights and activations: This test checks that multiple MAC units can load weights and activations correctly and simultaneously. It ensures that each MAC unit receives and processes its assigned data independently of the others.
  • Adder tree tests verifying that all levels of adder tree are summing to correct numbers: These tests validate the correctness of the adder tree at all levels. They ensure that the outputs from the MAC units are correctly summed through the hierarchical adder tree structure, producing accurate intermediate and final sums.
  • Read result tests that would test read out circuit: These tests focus on the readout circuit, verifying that the s_adder_tree result can be correctly read out in multiple 8-bit chunks. This ensures that the readout mechanism accurately reconstructs the full result over several clock cycles.
    • Read only with external signals: These tests focus on the entire design only using the external signals. This is the only test not commented in the final version.

The last three tests work on several test vectors to ensure correct operation with various numbers. These test vectors include a wide range of values and scenarios to thoroughly exercise the design and confirm its correctness under different conditions.

OP 00: LOAD_W

The LOAD_W function is an integral part of the testbench, designed to load weight values into the MAC units. This function is invoked by setting the LOAD_W opcode, which corresponds to the value 0b00 << 6. The opcode is combined with the MAC address to target a specific MAC unit for the weight load operation. The MAC address is specified in the lower bits of the ui_in signal, allowing precise selection of the MAC unit.

The process begins by setting the ui_in input to the LOAD_W opcode combined with the target MAC address. This action signals the DUT to prepare for loading the weight value into the specified MAC unit. Simultaneously, the weight value is provided via the uio_in input. To ensure that the command and data are properly registered and processed, the function waits for one clock cycle.

By following these steps, the LOAD_W function effectively communicates with the DUT to load weight values into the desired MAC units. This operation is crucial for initializing the MAC units with the appropriate weights for subsequent computations.


OP 00: LOAD_W Code Snippet for Reference:

async def write_weight(mac_address, weight):
    # Set the op code to 00 (write weight) and address
    dut.ui_in.value = (0b00 << 6) | mac_address
    # Set the weight data
    dut.uio_in.value = weight
    # Wait for a clock cycle to simulate the write
    await ClockCycles(dut.clk, 1)

OP 01: LOAD_A

The LOAD_A function is another essential component of the testbench, designed to load activation values into the MAC units. This function is activated by setting the LOAD_A opcode, which corresponds to the value 0b01 &amp;lt;&amp;lt; 6. Similar to LOAD_W, the opcode is combined with the MAC address to target a specific MAC unit for the activation load operation.

The process starts by setting the ui_in input to the LOAD_A opcode combined with the target MAC address. This action instructs the DUT to prepare for loading the activation value into the specified MAC unit. Concurrently, the activation value is provided via the uio_in input. To ensure the command and data are correctly registered and processed, the function waits for one clock cycle.

By adhering to these steps, the LOAD_A function successfully communicates with the DUT to load activation values into the designated MAC units. This operation is vital for initializing the MAC units with the appropriate activation values, enabling accurate computation during the subsequent processing stages.


OP 01: LOAD_A Code Snippet for Reference:

async def write_act(mac_address, a_value):
    # Set the op code to 01 (write a value) and address
    dut.ui_in.value = (0b01 << 6) | mac_address
    # Set the a value data
    dut.uio_in.value = a_value
    # Wait for a clock cycle to simulate the write
    await ClockCycles(dut.clk, 1)

OP 10: read_s

The read_s function is a critical part of the testbench designed to read the final result from the adder tree, known as s_adder_tree. Since the output interface of the system can only handle 8 bits at a time, the function retrieves the complete result over multiple clock cycles. The process begins by initializing the READ_S command. This is achieved by setting the ui_in input to the value corresponding to the READ_S opcode (0b10 &amp;lt;&amp;lt; 6). This command instructs the system to prepare the s_adder_tree result for reading. To ensure the command is properly registered and processed by the Device Under Test (DUT), the function waits for one clock cycle.

Following the initialization of the READ_S command, the function sets the ui_in input to a non-operational value (0b11 &amp;lt;&amp;lt; 6). This step ensures that the command remains stable and does not interfere with the readout process. Another clock cycle wait is introduced to guarantee that the data is ready to be read.

The core of the read_s function involves reading the s_adder_tree result in 8-bit chunks. The function initializes a variable, result, to store the combined output. It then enters a loop that iterates three times, corresponding to the three 8-bit chunks required to construct the 24-bit result. During each iteration, the function waits for the rising edge of the clock to synchronize with the DUT’s data output. This synchronization is crucial for accurate data retrieval. The function reads the current 8-bit chunk from the uo_out output, shifts the previously read data left by 8 bits, and combines it with the new chunk using a bitwise OR operation. This method ensures that the first chunk read corresponds to the most significant bits (MSBs) and the last chunk read corresponds to the least significant bits (LSBs).

After all three chunks have been read and combined, the function returns the complete result, which represents the full 19-bit s_adder_tree value. This process highlights the importance of synchronization with the clock signal and handling data in multiple cycles due to the 8-bit limitation of the output interface. By following these steps, the read_s function effectively reads and reconstructs the adder tree result, ensuring accurate and reliable verification of the system’s computation.


OP 10: read_s Code Snippet for Reference:

async def read_s():
    # Set the op code to 10 (read s_adder_tree)
    dut.ui_in.value = 0b10 << 6
    await ClockCycles(dut.clk, 1)

    # Set to non-operational value to avoid interference
    dut.ui_in.value = 0b11 << 6
    await ClockCycles(dut.clk, 1)

    result = 0
    for i in range(3):
        await RisingEdge(dut.clk)
        result = (result << 8) | int(dut.uo_out.value)
    return result

External hardware

Currently, the compute in SRAM does not interface with any external hardware components, and in reality in should not! It operates entirely within it own resources with its own defined set of control commands. Some external signals might be needed to orchestrate operation of multiple units with a control processor.

IO

# Input Output Bidirectional
0 Address bit 0 Data out bit 0 Data in bit 0
1 Address bit 1 Data out bit 1 Data in bit 1
2 Address bit 2 Data out bit 2 Data in bit 2
3 Address bit 3 Data out bit 3 Data in bit 3
4 Address bit 4 Data out bit 4 Data in bit 4
5 Address bit 5 Data out bit 5 Data in bit 5
6 Op Code bit 0 Data out bit 6 Data in bit 6
7 Op Code bit 1 Data out bit 7 Data in bit 7

Chip location

Controller Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Analog Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Analog Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux tt_um_chip_rom (Chip ROM) tt_um_factory_test (TinyTapeout 7 Factory Test) tt_um_analog_factory_test (TT07 Analog Factory Test) tt_um_urish_charge_pump (Dickson Charge Pump) tt_um_adennen_inverter (Aron's analog buffer test) tt_um_rejunity_z80 (Zilog Z80) tt_um_kianv_bare_metal (KianV RISC-V RV32E Baremetal SoC) tt_um_macros77_subneg (SUBNEG CPU) tt_um_eater_8bit (Tiny Eater 8 Bit) tt_um_ender_clock (clock) tt_um_wokwi_397140982440144897 (7-Seg 'Tiny Tapeout' Display) tt_um_wokwi_397142450561071105 (Padlock) tt_um_Burrows_Katie (QIF Neuron) tt_um_vga_clock (VGA clock) tt_um_aidenfoxivey (CRC-8 CCITT) tt_um_PUF (Reversible logic based Ring-Oscillator Physically Unclonable Function (RO-PUF)) tt_um_devinatkin_dual_oscillator (dual oscillator) tt_um_urish_simon (Simon Says memory game) tt_um_ajstein_stopwatch (Stopwatch Project) tt_um_rnunes2311_12bit_sar_adc (12 bit SAR ADC) tt_um_DanielZhu123 (calculator) tt_um_wokwi_397268065185737729 (Mini Light Up Game) tt_um_toivoh_basilisc_2816 (Basilisc-2816) tt_um_MichaelBell_rle_vga (RLE Video Player) tt_um_wokwi_397774697322214401 (secret L) tt_um_ccattuto_charmatrix (Serial Character LED Matrix) tt_um_The_Chairman_send_receive (Send Receive) tt_um_mini_aie_2x2 (mini-aie-cgra) tt_um_twin_tee_opamp_osc (Twin Tee Sine Wave Generator) tt_um_brucemack_sb_mixer (Single Balanced Mixer) tt_um_revenantx86_tinytpu (TinyTPU) tt_um_chess (Chess) tt_um_vga_perlin (VGA Perlin Noise) tt_um_calonso88_74181 (ALU 74181) tt_um_tinytapeout_dvd_screensaver (DVD Screensaver with Tiny Tapeout Logo (Tiny VGA)) tt_um_TD4_Assy_KosugiSubaru (4bit_CPU_td4) tt_um_drburke3_top (FastMagnitudeComparator) tt_um_pongsagon_tiniest_gpu (Tiniest GPU) tt_um_jorga20j_prng (8 bit PRNG) tt_um_ejfogleman_smsdac8 (8-bit DEM R2R DAC) tt_um_ccattuto_conway (Conway's Terminal) tt_um_fp_mac (FP-8 MAC Module) tt_um_router (router) tt_um_serdes (SerDes) tt_um_rejunity_analog_dac_ay8913 (AY-8193 single channel DAC) tt_um_riscv_spi_wrapper (RISCV32I with spi wrapper) tt_um_mos_bandgap (MOS Bandgap) tt_um_shadow1229_vga_player (VGA player) tt_um_explorer (Explorer) tt_um_rtmc_top_jrpetrus (Real Time Motor Controller) tt_um_28add11_QOAdecode (QOA Decoder) tt_um_toivoh_basilisc_2816_cpu_OL2 (Basilisc-2816) tt_um_afasolino (integer to posit converter and adder ) tt_um_8bit_vector_compute_in_SRAM (8-bit Vector Compute-in-SRAM) tt_um_lfsr (LFSR) tt_um_tnt_diff_rx (TT07 Differential Receiver test) tt_um_urish_spell (SPELL) tt_um_underserved (underserved) tt_um_dpetrisko_ttdll (TTDLL) tt_um_mitssdd (co processor for precision farming) tt_um_wokwi_399192124046955521 (ECC_test1) tt_um_dusterthefirst_project (Communicate 433) tt_um_xeniarose_sha256 (tiny sha256) tt_um_njp_micro (MicroCode Multiplier) tt_um_VishalBingi_r2r_4b (4-bit R2R DAC) tt_um_lisa (LISA Microcontroller with TTLC) tt_um_template (TT7 Simple Clock) tt_um_seanyen0_SIMON (SIMON) tt_um_agurrier_mastermind (Mastermind) tt_um_KolosKoblasz_mixer (Gilbert Mixer) tt_um_Saitama225_comp (Analog comparator) tt_um_tt7_meonwara (TBD) tt_um_multiplier_mbm (Modified Booth Multiplier) tt_um_delay_line_tmng (Delay Line Time Multiplexed NAND Gate) tt_um_mandelbrot_accel (Mandelbrot Set Accelerator (32-bit IEEE 754)) tt_um_dvxf_dj8v_dac (DJ8 8-bit CPU w/ DAC) tt_um_obriensp_pll (PLL Playground) tt_um_unisnano (unisnano) tt_um_alfiero88_CurrentTrigger (Current Mode Trigger) tt_um_CktA_InstAmp (Instrumentation Amplifier for Electrocardiogram Signal Adquisition) tt_um_lcasimon_tdc (Analog TDC) tt_um_neural_network (Neural Network dinamic) tt_um_PS_PWM (Phase Shifted PWM Modulator) tt_um_litneet64_ro_puf (RO-based Physically Unclonable Function (PUF)) tt_um_wokwi_399163158804194305 (Digital Timer) tt_um_algofoogle_raybox_zero (raybox-zero TT07 edition) tt_um_wokwi_399336892246401025 (UART) tt_um_wokwi_399169514887574529 (Gaussian Blur) tt_um_maheredia (GPS signal generator) tt_um_pwm_elded (UACJ_PWM) tt_um_wokwi_399447152724198401 (8-Bit Register) tt_um_adonairc_dda (DDA solver for van der Pol oscillator) tt_um_6bitaddr (6 bit addr) tt_um_btflv_subleq (Subleq CPU with FRAM and UART) tt_um_emern_top (badGPU) tt_um_wokwi_399469995038350337 (dEFAULt 2hAC) tt_um_mixed_signal_pulse_gen (mixed_signal_pulse_gen) tt_um_maxluppe_digital_analog (All Digital DAC and Analog Comparators) tt_um_pyamnihc_dummy_counter (Dummy Counter) tt_um_wokwi_399488550855755777 (My 9-year-old son made an 8-bit counter chip) tt_um_toivoh_basilisc_2816_cpu_exp (Basilisc-2816 Experimental) tt_um_htfab_fprn (Field Programmable Resistor Network) tt_um_rejunity_ay8913 (Classic 8-bit era Programmable Sound Generator AY-3-8913) tt_um_maxluppe_NIST (Four NIST SP 800-22 tests implementation) tt_um_vga_snake (VGA Snake Game) tt_um_cm_1 (GDS counter-measures experiment 1) tt_um_nurirfansyah_alits02 (Analog Test Circuit ITS 2) tt_um_analog_rf_readout_circuit (RF_peripheral_circuits) tt_um_jleightcap (fractran-tt) tt_um_wokwi_399518371950068737 (Full-adder out of a kmap) tt_um_davidparent_hdl (PRBS Generator) tt_um_adia_psu_seq_test (Adiabatic PSU sequencer test) tt_um_spacecat_chan_john_pong_the_second (John Pong The Second) tt_um_rajum_iterativeMAC (Iterative MAC) tt_um_asinghani_tinywspr (TinyWSPR) tt_um_thatoddmailbox (DuckCPU) tt_um_rejunity_vga (VGA Checkers) tt_um_8bitadder (Ripple Carry Adder 8 bit) tt_um_vzayakov_top (Pong-VGA) tt_um_pa1mantri_cdc_fifo (Clock Domain Crossing FIFO) Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available