330 Tiny Tapeout Tensor Processing Unit

330 : Tiny Tapeout Tensor Processing Unit

Design render

How it works

This project is a small-scale matrix multiplier inspired by the Tensor Processing Unit (TPU), an AI inference accelerator ASIC developed by Google.

It multiplies two 2x2 matrices with signed 8-bit (1 byte) elements to an output matrix with signed 16-bit (2-byte) elements. It does so in a systolic array circuit, where flow of data is facilitated through the connections between a grid of 4 Multiply-Add-Accumulate (MAC) Processing Elements (PEs).

To store inputs prior to computation, it contains 2 matrices in memory registers, which occupy a total of 8 bytes.

To orchestrate the flow of data between inputs, memory, and outputs, a control unit coordinates state transitions, loads, and stores automatically.

Finally, a feeder module interfaces with the matrix multiplier to schedule the inputs and outputs to and from the systolic array.

It is capable of running over 99.8 Million Operations Per Second when using a maximum throughput streamed processing pattern to multiply big matrices in 2x2 blocks.

System Architecture

Alt text

The Processing Element

Let's start from the most atomic element of the matrix multiplier unit (MMU): its processing element (PE). The value stored within each PE contributes an element to the output.

Signal Name Direction Width Description
clk input 1 The clock!
rst input 1 Reset
clear input 1 Clear PE
a_in input 8 Input value
a_out output 8 Pass-on of input
b_in input 8 Weight value
b_out output 8 Pass-on of weight
c_out output 16 Accumulation

Since each output element of a matrix multiplication is a sum of products, the PE's primary operation is a multiply-add-accumulate.

It will take input terms a_in and b_in, multiply them, and then add them to the accumulator value c_out. Due to the larger values induced by multiplication, the accumulator holds more bits.

Since adjacent PEs corresponding to adjacent elements of the output matrix need the same input and weight values, these input terms are sent to a_out and b_out respectively, which are connected to other PEs by the systolic array.

Once the multiplication is done, the control unit will want to clear the PEs so that they can reset the accumulation for the next matrix product, which is facilitated via the automatic clear signal.

On the other hand, it is non-ideal to reset the entire chip, as it wastes time (an entire clock cycle) and is unnecessary to reset other elements such as memory.

The output is 16 bits for the 8-bit inputs, to account for the property of multiplication.

The Systolic Array

Signal Name Direction Width Description
clk input 1 The clock!
rst input 1 Reset
clear input 1 Forwarded to PEs
activation input 1 Enables ReLU
a_data0 input 8 Input value of top-left PE
a_data1 input 8 Input value of bottom-left PE
b_data0 input 8 Input value of top-left PE
b_data1 input 8 Input value of top-right PE
c00 output 16 Top-left output value
c01 output 16 Top-right output value
c10 output 16 Bottom-left output value
c11 output 16 Bottom-right output value

The systolic array is a network, or grid, of PEs. In this 2x2 multiplier, the result is a 4-element square matrix, so there are 4 PEs.

Internally, the systolic array maintains internal registers and a matrix of accumulators that are read by and written into by the PEs.

This includes a 2x3 matrix for the input matrix values, a 3x2 matrix for the weight matrix values, and a 2x2 for the final output.

The extra values beyond 2x2 for input & weight matrix values are to allow the PEs at the edge of the grid to send their input/weight values to a register to a "garbage".

At each clock cycle, elements will flow between the PEs. The inputs will flow from "left to right", and the weights will flow from "top to bottom". To add new values, inputs have to be provided to the ports at the "left", and weights have to be provided to the ports at the "top".

Then, the PEs are instantiated using the Verilog compile-time construct of genvar, in which signals of the PE are connected to specified indices of the internal systolic array signals. Makes the code clean and easy to write!

Diagram of PE arrangement below:

Alt text

Unified Memory

The unified memory module (memory.v) is an on-chip store that holds both weight and input matrices for quick access to both values during computations.

Signal Name Direction Width Description
clk input 1 System clock
rst input 1 Active-high reset
load_en input 1 Enable signal for matrix loading
addr input 3 Memory address for matrix elements
in_data input 8 Current input matrix
weight[0,1,2,3] output 8 Weight matrix elements
input[0,1,2,3] output 8 Input matrix elements

The only source of the memory will come from ui_in ports directly. When load_en goes high, the current element (in_data) connected to all bits at the dedicated input ui_in is loaded into memory at its specified byte address, on the next rising edge.

Matrix elements are held internally within an 8x8 register (sram). In terms of outputs, sram[0..3] maps to weight[0..3], sram[4..7] maps to input[0..3]. An address (addr) is generated by the control unit to correctly load elements into the internal register.

The entire memory space of 8 bytes is visible at the output ports asynchronously.

Control Unit

The control unit (control_unit.v) serves as the central orchestrator for the entire TPU, coordinating the flow of data between memory, the systolic array, and output collection through a carefully designed finite state machine (FSM).

Signal Name Direction Width Description
clk input 1 System clock
rst input 1 Active-high reset
load_en input 1 Enable signal for matrix loading
mem_addr output 3 Memory address for matrix elements
mmu_en output 1 Enable signal for MMU operations
mmu_cycle output 3 Current cycle count for MMU timing
State Machine Architecture

The control unit implements a 3-state FSM that manages the complete matrix multiplication pipeline:

  1. S_IDLE (2'b00): The default waiting state where the system remains until a matrix multiplication operation is requested via the load_en signal.

  2. S_LOAD_MATS (2'b01): The matrix loading phase when 8 matrix elements (4 for each 2x2 matrix) are sequentially loaded into memory. Because the order is already assumed to be row-major order, left matrix first, then the address value tracked in mem_addr will only increment when load_en is asserted, from 0 up until 7 when it resets to get ready for future input matrices.

  3. S_MMU_FEED_COMPUTE_WB (2'b10): The computation and output phase, taking 8 cycles total when the systolic array performs the last few operations of matrix multiplication, and makes the 4 outputs available in 16 bits each, 1 every 2 cycles. At the same time, the chip is available for streamed processing so that 8 new elements, representing 2 new 2x2 matrices, can be input for the next round of outputs occurring right after the current round.

Orchestration Logic

The control unit coordinates several critical functions:

  • Memory Interface Management: Through the mem_addr output (control_unit.v:9), it generates sequential memory addresses (0-7) during the loading and streaming phase, ensuring matrix elements are stored in the correct memory locations for later retrieval.

  • MMU Timing Control: The mmu_cycle signal (control_unit.v:13) provides precise timing information to the MMU feeder module, enabling it to:

  • Feed correct matrix elements to the systolic array at appropriate cycles

  • Determine when computation results are ready for output

  • Clear processing elements after computation completion

  • Pipeline Coordination: The mmu_en signal (control_unit.v:12) acts as the master enable for the entire computation pipeline, transitioning from low during loading to high during computation phases. This is to ensure that elements are only loaded into the systolic array during the first round of set up inputs when all inputs are ready. Otherwise, if the chip is not initialized with all inputs in memory, it cannot complete computation and hence should not start it.

    • However, for maximum throughput, the mmu_en signal is asserted when 6 of the 8 elements making up 2 matrices are input, so that computation begins when we have the elements to produce enough outputs, and is overlapped with matrix loads, and completes in the middle of the output cycle.
  • Streamed Processing: During the 8-cycle output phase, the chip is available to take in 8 new input bytes provided at the ui_in ports. This is a streamlined flow of execution, as the input and output ports will henceforth be constantly used. After this 8-cycle output phase, the input bytes input during that phase can now begin outputting, while subsequent inputs can be further written.

    • However, if the user chooses not to write new inputs during the output phase, the outputs continue unabated, and the systolic array matrix accumulators automatically reset once the outputs are complete.
Critical Timing Relationships

The control unit implements sophisticated timing logic based on the systolic array's computational pipeline:

  • Cycle 0: Initial data feeding begins (a00×b00 starts)
  • Cycle 1: First partial products computed, additional data fed
  • Cycle 2: First result (c00) becomes available, Next input group can begin.
    • Future value of A00 expected at input during streamed processing
  • Cycle 3: Second and third results (c01, c10) become available simultaneously
    • Future value of A01 expected at input
  • Cycle 4: Final result (c11) becomes available
    • Future value of A10 expected at input
  • Cycle 5: All outputs remain stable.
    • Future value of A11 expected at input
  • Cycle 6: Output continues
    • Future value of B00 expected at input
  • Cycle 7: Output continues
    • Future value of B01 expected at input
  • Back to Cycle 0: Output continues. Since 6 of 8 of the next input elements would be on chip, the next cycle of data feeding begins
    • Future value of B10 expected at input
  • Cycle 1 again: Last output element, first partial products of next output are computed.
    • Future value of B11 expected at input
  • Cycle 2 again: Repeat of description above, etc.
State Transition Logic

State transitions are triggered by specific conditions:

  • S_IDLE → S_LOAD_MATS: When load_en is asserted (control_unit.v:30-32)
  • S_LOAD_MATS → S_MMU_FEED_COMPUTE_WB: When all 8 elements are loaded (mem_addr == 3'b111) (control_unit.v:37-38)
  • S_LOAD_MATS → S_MMU_FEED_COMPUTE_WB: When all 8 elements are loaded (mat_elems_loaded == 3'b111) (control_unit.v:37-38)
  • Afterwards, the state machine stays in S_MMU_FEED_COMPUTE_WB, but essentially cycles through counts of mem_addr and mmu_cycle to keep track of the memory address writing for streamed processing and maintain a rhythm for the Matrix Unit Feeder.
Integration with Other Modules

The control unit interfaces with all major TPU components:

  • Memory Module: Provides addressing (mem_addr) and coordinates write operations during loading
  • MMU Feeder: Supplies enable signal (mmu_en) and cycle timing (mmu_cycle) for data routing and output selection
  • Top-level TPU: Receives external load_en control signal and coordinates the entire operation sequence

This design ensures that matrix multiplication operations proceed automatically once initiated, with the control unit handling all timing dependencies and data flow coordination between the TPU's constituent modules.

The Matrix Unit Feeder

The Matrix Unit Feeder (in mmu_feeder.v) is the interface between the control unit and the computational unit (MMU), facilitating smooth data flow between the internal components of the TPU and outputs to the host. When enabled, its role is to either feed the expected matrix data from host to MMU, or to direct computed matrix outputs from MMU to host; this is decided based on the mmu_cycle defined and cycled through (0-7 repeating constantly) by the control unit.

Signal Name Direction Width Description
clk input 1 System clock
rst input 1 Active-high reset
en input 1 Enable signal for MMU operations
mmu_cycle input 3 Current cycle count for timing
weight[0,1,2,3] input 8 Weight matrix from memory
input[0,1,2,3] input 8 Input matrix from memory
c[00,01,10,11] input 8 Computed element output from MMU
done output 1 Signal to host that output is ready
host_outdata output 8 Data register to output to host
a_data[0,1] output 8 Output A to MMU for computation
b_data[0,1] output 8 Output B to MMU for computation

The weight and input matrices are taken from memory. The feeder will set the expected values of a_data0/1 and b_data0/1 depending on the value of mmu_cycle. Output values are 16 bits, but the maximum data width of the output ports is 8-bit so we have to feed half an element a cycle!

  • Cycle 0:
    • Blank data is sent to the output port. a_data0 = weight[0], b_data0 = input[0], done = 0
  • Cycle 1:
    • During the initial cycle when the memory is first populated, cycle 1 occurs during the input of the second matrix, which also overlaps with compute as the counter has already incremented.
    • This module will send a value to its output that is equal to the lower 8 bits of the product of the A00 and B00, which are the top-left elements of the matrices. This is due to the settings that enable streamed processing. The chip should therefore ignore that output.
    • In terms of feeding the systolic array, if fused transpose is disabled, then a_data0 = weight[1], a_data1 = weight[2], b_data0 = input[2], b_data1 = input[1].
    • If fused transpose is enabled, the b_data0 and b_data1 values are swapped.
  • Cycle 2:
    • In this cycle, the output counter is 0. The output sends the upper 8 bits of C00, the top-left output element.
    • To signal that the outputs are coming, the done signal is asserted, which is visible from the user.
    • The values given to the systolic array are a_data1 = weight[3], b_data1 = input[3].
  • Cycle 3:
    • In this cycle, the output counter is 1. The output sends the lower 8 bits of C00, the top-left output element.
    • Since we have finished sending all the last input values to the systolic array, intermediate feeds will have a value of 0, having no effect on the operation of MAC units.
  • Cycle 4:
    • In this cycle, the output counter is 2. The output sends the upper 8 bits of C01, the top-right output element.
  • Cycle 5:
    • In this cycle, the output counter is 3. The output sends the lower 8 bits of C01, the top-right output element.
  • Cycle 6:
    • In this cycle, the output counter is 4. The output sends the upper 8 bits of C10, the bottom-left output element.
  • Cycle 7:
    • In this cycle, the output counter is 5. The output sends the lower 8 bits of C10, the bottom-left output element.
  • Back to Cycle 0: (set by control unit)
    • In this cycle, the output counter is 6. The output sends the upper 8 bits of C11, the bottom-right output element.
    • The clear signal is asserted to make way for the computation of the next input to accumulate from 0.
    • Therefore, to preserve the lower 8 bits of C11 output in the next cycle, we assign the value to a "pipeline register", aptly named tail_hold as it holds the tail of the output.
    • The systolic array feeding pattern is the exact same as was shown above.
  • Cycle 1 Again:
    • In this cycle, the output counter is 7. The output sends the lower 8 bits of C11, the bottom-right output element.
    • The output count resets to 0 automatically.

For similar details on timing relationships, see Critical Timing Relationships above, in the Control Unit section.

How to test

Notation: the matrix element A_xy denotes a value in the xth row and yth column of the matrix A.

The module will assume an order of input of A matrix values and B matrix values, and outputs. That is, it is expected that inputs come in order of A00, A01, A10, A11, B00, B01, B10, B11, and the outputs will come in the order of C00, C01, C10, C11. This keeps the chip simple and avoids extra logic/user input.

Setup

  1. Power Supply: Connect the chip to a stable power supply as per the voltage specifications.
  2. Clock Signal: Provide a stable clock signal to the clk pin.
  3. Reset: Ensure the rst_n pin is properly connected to allow resetting the chip.

A Matrix Multiplication Round

  1. Initial Reset
    • Perform a reset by pulling the rst_n pin low to 0, and waiting for a single clock signal before pulling it back high to 1. This sets initial state values.
  2. Initial Matrix Load
    • Load 8 matrix elements into the chip, one per cycle. For example, if your matrices are [[1, 2], [3, 4]], [[5, 6], [7, 8]], you would load in the row-major-first-matrix-first order of 1, 2, 3, 4, 5, 6, 7, 8. This occurs by setting the 8 ui_in pins to the 8-bit value of the set matrix element, and waiting one clock cycle before the next can be loaded.
  3. Collect Output & Send Next Inputs
    • Thanks to the aggressive pipelining implemented in the chip, once the matrices are loaded, you can already start collecting output!
    • Output elements will be 16 bits each, but since the output port is only 8 bits, one element is output in 2 cycles, with the upper half (bits 15 - 8) in the first cycle, and the lower half (bits 7 - 0) in the second cycle.
    • To collect outputs, wait for a single clock edge, and then read the uo_out pin for the 8-bit value. Repeat again to get the full 16-bit value. Overall, the matrix output at uo_out will be in the order of c_00, c_01, c_10, c_11, taking 8 cycles to output 4 elements.
    • For the above example, the output would be in the order of [19, 22, 43, 50], starting from the cycle right after you finish your last load, and ending 8 cycles afterwards.
    • It is also recommended that in those same 8 cycles, the next 2 input matrices are sent to the ui_in pin. That will be 1 element per cycle for 2 serial, row-major 2x2 matrix inputs, for 8 cycles total.
  4. Repeat
  5. Input Options
    • Note that if new matrices are not input during the output cycle, i.e. the ui_in pin is set to 0, then it is the equivalent of "flushing the pipeline", as once the output is complete, it is the equivalent of starting at step 2.

Below is a visual of an example matrix multiplication round through the systolic array. Note that while it behaves similarly to the chip, the chip's matrix inputs to the systolic array are the diagram's order inverted across the matrix diagonal.

Alt text

Matrix Multiplication Options

The example shown above is a very simple and plain 2x2 matrix multiplication. However, this TPU chip offers additional options.

The first is the ability to compute the product $AB^T$, which is the first matrix multiplied by the transpose of the second. This saves time computing a transpose taken by a CPU instruction in $O(n^2)$ time, where n is a rough measure of the matrix dimension. Instead, it is fused with the entire process, taking no extra time.

The second is the ability to run the Rectified Linear Unit (ReLU) activation function, commonly seen in neural networks for approximating non-linear patterns in data.

The third, which is provided as a software interface option, is the ability to multiply bigger matrices, of all compatible dimensions, in 2x2 blocks.

Example Result

Below is a timing diagram showing signal progression for a streamed processing pattern of the TPU chip, generated by GTKWave:

Alt text

The part highlighting Input/Output streaming is the fact that mem_addr, which increments whenever load is enabled, keeps incrementing like 0-7, 0-7, etc, and so is the output count, albeit slightly offset in time.

One can also observe the pattern in which elements are fed into the systolic array, as seen by a_data0, a_data1, b_data0, b_data1 signals, and the output "waterfall" flow of output appearances seen inside c00, c01, c10, c11.

Scaling it up

Earlier, it was mentioned that you could scale the multiplication up to any dimension. What else does this mean? AI inference! We are able to run forward inference of a Quantization-Aware-Trained (QAT) machine learning model using the chip's logic.

The model is trained to recognize black-and-white images of single handwritten digits from the MNIST dataset.

In the demonstration, which is kept simple, I run QAT on the local PC, and then run forward inference with this model on the chip. It was able to successfully recognize 2 out of 3 images in a test batch, which is far superior to a coin flip.

External hardware

An external microcontroller will send signals over the chip interface, including the clock signal, which will allow it to coordinate I/O on clock edges.

Acknowledgements

  • William Zhang: Processing Elements, Systolic Array, Module Compilation & Integration, Pipelining Optimization
  • Ethan Leung: Matrix Unit Feeder
  • Guhan Iyer: Unified Memory
  • Yash Karthik: Control Unit
  • ECE 298A Course Staff: Prof. John Long, Prof. Vincent Gaudet, Refik Yalcin

An earlier iteration of this project is located at this repository, in which the original plan was to submit to the IHP25B shuttle.

IO

#InputOutputBidirectional
0IN0OUT0LOAD_EN (input)
1IN1OUT1TRANSPOSE (input)
2IN2OUT2ACTIVATION (input)
3IN3OUT3Unused
4IN4OUT4Unused
5IN5OUT5Unused
6IN6OUT6Unused
7IN7OUT7DONE (output)

Chip location

Controller Mux Mux Mux Mux Mux Mux Mux Mux Mux Analog Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Analog Mux Mux Mux Mux Analog Mux Mux Mux Mux Mux Mux tt_um_chip_rom (Chip ROM) tt_um_factory_test (Tiny Tapeout Factory Test) tt_um_oscillating_bones (Oscillating Bones) tt_um_rebelmike_incrementer (Incrementer) tt_um_rebeccargb_tt09ball_gdsart (TT09Ball GDS Art) tt_um_tt_tinyQV (TinyQV 'Asteroids' - Crowdsourced Risc-V SoC) tt_um_DalinEM_asic_1 (ASIC) tt_um_urish_simon (Simon Says memory game) tt_um_rburt16_bias_generator (Bias Generator) tt_um_librelane3_test (Tiny Tapeout LibreLane 3 Test) tt_um_10_vga_crossyroad (Crossyroad) tt_um_rebeccargb_universal_decoder (Universal Binary to Segment Decoder) tt_um_rebeccargb_hardware_utf8 (Hardware UTF Encoder/Decoder) tt_um_rebeccargb_intercal_alu (INTERCAL ALU) tt_um_rebeccargb_dipped (Densely Packed Decimal) tt_um_rebeccargb_styler (Styler) tt_um_rebeccargb_vga_timing_experiments (VGA Timing Experiments) tt_um_rebeccargb_colorbars (Color Bars) tt_um_rebeccargb_vga_pride (VGA Pride) tt_um_cw_vref (Current-Mode Bandgap Reference) tt_um_tinytapeout_logo_screensaver (VGA Screensaver with Tiny Tapeout Logo) tt_um_rburt16_opamp_3stage (OpAmp 3stage) tt_um_gamepad_pmod_demo (Gamepad Pmod Demo) tt_um_micro_tiles_container (Micro tile container) tt_um_virantha_enigma (Enigma - 52-bit Key Length) tt_um_jamesrosssharp_1bitam (1bit_am_sdr) tt_um_jamesrosssharp_tiny1bitam (Tiny 1-bit AM Radio) tt_um_MichaelBell_rle_vga (RLE Video Player) tt_um_MichaelBell_mandelbrot (VGA Mandelbrot) tt_um_murmann_group (Decimation Filter for Incremental and Regular Delta-Sigma Modulators) tt_um_betz_morse_keyer (Morse Code Keyer) tt_um_urish_giant_ringosc (Giant Ring Oscillator (3853 inverters)) tt_um_tiny_pll (Tiny PLL) tt_um_tc503_countdown_timer (Countdown Timer) tt_um_richardgonzalez_ped_traff_light (Pedestrian Traffic Light) tt_um_analog_factory_test (TT08 Analog Factory Test) tt_um_alexandercoabad_mixedsignal (mixedsignal) tt_um_tgrillz_sixSidedDie (Six Sided Die) tt_um_mattvenn_analog_ring_osc (Ring Oscillators) tt_um_vga_clock (VGA clock) tt_um_mattvenn_r2r_dac_3v3 (Analog 8 bit 3.3v R2R DAC) tt_um_mattvenn_spi_test (SPI test) tt_um_quarren42_demoscene_top (asic design is my passion) tt_um_micro_tiles_container_group2 (Micro tile container (group 2)) tt_um_z2a_rgb_mixer (RGB Mixer demo) tt_um_frequency_counter (Frequency counter) tt_um_urish_sic1 (SIC-1 8-bit SUBLEQ Single Instruction Computer) tt_um_tobi_mckellar_top (Capacitive Touch Sensor) tt_um_log_afpm (16-bit Logarithmic Approximate Floating Point Multiplier) tt_um_uwasic_dinogame (UW ASIC - Optimized Dino) tt_um_ece298a_8_bit_cpu_top (8-Bit CPU) tt_um_tqv_peripheral_harness (Rotary Encoder Peripheral) tt_um_led_matrix_driver (SPI LED Matrix Driver) tt_um_2048_vga_game (2048 sliding tile puzzle game (VGA)) tt_um_mac (MAC) tt_um_dpmunit (DPM_Unit) tt_um_nitelich_riscyjr (RISCY Jr.) tt_um_nitelich_conway (Conway's GoL) tt_um_pwen (Pulse Width Encoder) tt_um_mcs4_cpu (MCS-4 4004 CPU) tt_um_mbist (Design of SRAM BIST) tt_um_weighted_majority (Weighted Majority Voter / Trend Detector) tt_um_brandonramos_VGA_Pong_with_NES_Controllers (VGA Pong with NES Controllers) tt_um_brandonramos_opamp_ladder (2-bit Flash ADC) tt_um_NE567Mixer28 (OTA folded cascode) tt_um_acidonitroso_programmable_threshold_voltage_sensor (Programmable threshold voltage sensor) tt_um_DAC1 (tt_um_DAC1) tt_um_trivium_stream_processor (Trivium Stream Cipher) tt_um_analog_example (Digital OTA) tt_um_sortaALUAriaMitra (Sorta 4-Bit ALU) tt_um_RoyTr16 (Connect Four VGA) tt_um_jnw_wulffern (JNW-TEMP) tt_um_serdes (Secure SERDES with Integrated FIR Filtering) tt_um_limpix31_r0 (VGA Human Reaction Meter) tt_um_torurstrom_async_lock (Asynchronous Locking Unit) tt_um_galaguna_PostSys (Post's Machine CPU Based) tt_um_edwintorok (Rounding error) tt_um_td4 (tt-td04) tt_um_snn (Reward implemented Spiking Neural Network) tt_um_matrag_chirp_top (Tiny Tapeout Chirp Modulator) tt_um_sha256_processor_dvirdc (SHA-256 Processor) tt_um_pchri03_levenshtein (Fuzzy Search Engine) tt_um_AriaMitraClock (12 Hour Clock (with AM and PM)) tt_um_swangust (posit8_add) tt_um_DelosReyesJordan_HDL (Reaction Time Test) tt_um_upalermo_simple_analog_circuit (Simple Analog Circuit) tt_um_swangust2 (posit8_mul) tt_um_thexeno_rgbw_controller (RGBW Color Processor) tt_um_top_layer (Spike Detection and Classification System) tt_um_Alida_DutyCycleMeter (Duty Cycle Meter) tt_um_dco (Digitally Controlled Oscillator) tt_um_8bitalu (8-bit Pipelined ALU) tt_um_resfuzzy (resfuzzy) tt_um_javibajocero_top (MarcoPolo) tt_um_Scimia_oscillator_tester (Oscillator tester) tt_um_ag_priority_encoder_parity_checker (Priority Encoder with Parity Checker) tt_um_tnt_mosbius (tnt's variant of SKY130 mini-MOSbius) tt_um_program_counter_top_level (Test Design 1) tt_um_subdiduntil2_mixed_signal_classifier (Mixed-signal Classifier) tt_um_dac_test3v3 (Analog 8 bit 3.3v R2R DAC) tt_um_LPCAS_TP1 ( LPCAS_TP1 ) tt_um_regfield (Register Field) tt_um_delaychain (Delay Chain) tt_um_tdctest_container (Micro tile container) tt_um_spacewar (Spacewar) tt_um_Enhanced_pll (Enhance PLL) tt_um_romless_cordic_engine (ROM-less Cordic Engine) tt_um_ev_motor_control (PLC Based Electric Vehicle Motor Control System) tt_um_plc_prg (PLC-PRG) tt_um_kishorenetheti_tt16_mips (8-bit MIPS Single Cycle Processor) tt_um_snn_core (Adaptive Leaky Integrate-and-Fire spiking neuron core for edge AI) tt_um_myprocessor (8-bit Custom Processor) tt_um_sjsu (SJSU vga demo) tt_um_vedic_4x4 (Vedic 4x4 Multiplier) tt_um_braun_mult (8x8 Braun Array Multiplier) tt_um_r2r_dac (4-bit R2R DAC) tt_um_stochastic_integrator_tt9_CL123abc (Stochastic Integrator) tt_um_uart (UART Controller with FIFO and Interrupts) tt_um_lfsr_stevej (Linear Feedback Shift Register) tt_um_FFT_engine (FFT Engine) tt_um_tpu (Tiny Tapeout Tensor Processing Unit) tt_um_tt_tinyQVb (TinyQV 'Berzerk' - Crowdsourced Risc-V SoC) tt_um_IZ_RG_22 (IZ_RG_22) tt_um_32_bit_fp_ALU_S_M (32-bit floating point ALU) tt_um_AriaMitraGames (Games (Tic Tac Toe and Rock Paper Scissors)) tt_um_sc_bipolar_qif_neuron (Stochastic Computing based QIF model neuron) tt_um_mac_spst_tiny (Low Power and Enhanced Speed Multiplier, Accumulator with SPST Adder) tt_um_kb2ghz_xalu (4-bit minicomputer ALU) tt_um_emmersonv_tiq_adc (3 Bit TIQ ADC) tt_um_simonsays (Simon Says) tt_um_BNN (8-bit Binary Neural Network) tt_um_anweiteck_2stageCMOSOpAmp (2 Stage CMOS Op Amp) tt_um_6502 (Simplified 6502 Processor) tt_um_swangust3 (posit8_div) tt_um_jonathan_thing_vga (VGA-Video-Player) tt_um_wokwi_412635532198550529 (ttsky-pettit-wokproc-trainer) tt_um_vga_hello_world (VGA HELLO WORLD) tt_um_jyblue1001_pll (Analog PLL) tt_um_BryanKuang_mac_peripheral (8-bit Multiply-Accumulate (MAC) with 2-Cycle Serial Interface) tt_um_rebeccargb_tt09ball_screensaver (TT09Ball VGA Screensaver) tt_um_openfpga22 (Open FGPA 2x2 design) tt_um_andyshor_demux (Demux) tt_um_flash_raid_controller (SPI flash raid controller) tt_um_jonnor_pdm_microphone (PDM microphone) tt_um_digital_playground (Sky130 Digital Playground) tt_um_mod6_counter (Mod-6 Counter) tt_um_BMSCE_T2 (Choreo8) tt_um_Richard28277 (4-bit ALU) tt_um_shuangyu_top (Calculator) tt_um_wokwi_441382314812372993 (Sumador/restador de 4 bits) tt_um_TensorFlowE (TensorFlowE) tt_um_wokwi_441378095886546945 (7SDSC) tt_um_wokwi_440004235377529857 (Latched 4-bits adder) tt_um_dlmiles_tqvph_i2c (TinyQV I2C Controller Device) tt_um_markgarnold_pdp8 (Serial PDP8) tt_um_wokwi_441564414591667201 (tt-parity-detector) tt_um_vga_glyph_mode (VGA Glyph Mode) tt_um_toivoh_pwl_synth (PiecewiseOrionSynth Deluxe) tt_um_minirisc (MiniRISC-FSM) tt_um_wokwi_438920793944579073 (Multiple digital design structures) tt_um_sleepwell (Sleep Well) tt_um_lcd_controller_Andres078 (LCD_controller) tt_um_SummerTT_HDL (SJSU Summer Project: Game of Life) tt_um_chrishtet_LIF (Leaky Integrate and Fire Neuron) tt_um_diff (ttsky25_EpitaXC) tt_um_htfab_split_flops (Split Flops) tt_um_alu_4bit_wrapper (4-bit ALU with Flags) tt_um_tnt_rf_test (TTSKY25A Register File Test) tt_um_mosbius (mini mosbius) tt_um_robot_controller_top_module (AR Chip) tt_um_flummer_ltc (Linear Timecode (LTC) generator) tt_um_stress_sensor (Tiny_Tapeout_2025_three_sensors) tt_um_krisjdev_manchester_baby (Manchester Baby) tt_um_mbikovitsky_audio_player (Simple audio player) tt_um_wokwi_414123795172381697 (TinySnake) tt_um_vga_example (Jabulani Ball VGA Demo ) tt_um_stochastic_addmultiply_CL123abc (Stochastic Multiplier, Adder and Self-Multiplier) tt_um_nvious_graphics (nVious Graphics) tt_um_pe_simonbju (pe) tt_um_mikael (TinyTestOut) tt_um_brent_kung (brent-kung_4) tt_um_7FM_ShadyPong (ShadyPong) tt_um_algofoogle_vga_matrix_dac (Analog VGA CSDAC experiments) tt_um_tv_b_gone (TV-B-Gone) tt_um_sjsu_vga_music (SJSU Fight Song) tt_um_fsm_haz (FSM based RISC-V Pipeline Hazard Resolver) tt_um_dma (DMA controller) tt_um_3v_inverter_SiliconeGuide (Analog Double Inverter) tt_um_rejunity_lgn_mnist (LGN hand-written digit classifier (MNIST, 16x16 pixels)) tt_um_gray_sobel (Gray scale and Sobel filter for Edge Detection) tt_um_Xgamer1999_LIF (Demonstration of Leaky integrate and Fire neuron SJSU) tt_um_dac12 (12 bit DAC) tt_um_voting_machine (Digital Voting Machine) tt_um_updown_counter (8bit_up-down_counter) tt_um_openram_top (Single Port OpenRAM Testchip) tt_um_customalu (Custom ALU) tt_um_assaify_mssf_pll (24 MHz MSSF PLL) tt_um_Maj_opamp (2-Stage OpAmp Design) tt_um_wokwi_442131619043064833 (Encoder 7 segments display) tt_um_wokwi_441835796137492481 (TESVG Binary Counter and shif register ) tt_um_combo_haz (Combinational Logic Based RISC-V Pipeline Hazard Resolver) tt_um_tx_fsm (Design and Functional Verification of Error-Correcting FIFO Buffer with SECDED and ARQ ) tt_um_will_keen_solitaire (solitaire) tt_um_rom_vga_screensaver (VGA Screensaver with embedded bitmap ROM) tt_um_13hihi31_tdc (Time to Digital Converter) tt_um_dteal_awg (Arbitrary Waveform Generator) tt_um_LIF_neuron (AFM_LIF) tt_um_rebelmike_register (Circulating register test) tt_um_MichaelBell_hs_mul (8b10b decoder and multiplier) tt_um_SNPU (random_latch) tt_um_rejunity_atari2600 (Atari 2600) tt_um_bit_serial_cpu_top (16-bit bit-serial CPU) tt_um_semaforo (semaforo) tt_um_bleeptrack_cc1 (Cross stitch Creatures #1) tt_um_bleeptrack_cc2 (Cross stitch Creatures #2) tt_um_bleeptrack_cc3 (Cross stitch Creatures #3) tt_um_bleeptrack_cc4 (Cross stitch Creatures #4) tt_um_bitty (Bitty) tt_um_spi2ws2811x16 (spi2ws2811x8) tt_um_uart_spi (UART and SPI Communication blocks with loopback) tt_um_urish_charge_pump (Dickson Charge Pump) tt_um_adc_dac_tern_alu (adc_dac_BCT_addr_ALU_STI) tt_um_sky1 (GD Sky Processor) tt_um_fifo (ASYNCHRONOUS FIFO) tt_um_TT16 (Asynchronous FIFO) tt_um_axi4lite_top (Axi4_Lite) tt_um_TT06_pwm (PWM Generator) tt_um_hack_cpu (HACK CPU) tt_um_marxkar_jtag (JTAG CONTROLLER) tt_um_cache_controller (Simple Cache Controller) tt_um_stopwatchtop (Stopwatch with 7-seg Display) tt_um_adpll (all-digital pll) tt_um_tnt_rom_test (TT09 SKY130 ROM Test) tt_um_tnt_rom_nolvt_test (TT09 SKY130 ROM Test (no LVT variant)) tt_um_wokwi_414120207283716097 (fulladder) tt_um_kianV_rv32ima_uLinux_SoC (KianV uLinux SoC) tt_um_tv_b_gone_rom (TV-B-Gone-EU) Available Available Available Available Available Available Available Available Available Available Available Available Available