195 Tensor Processing Unit(TPU)

195 : Tensor Processing Unit(TPU)

Design render

How it works

This project is a small-scale matrix multiplier inspired by the Tensor Processing Unit (TPU), an AI inference accelerator ASIC developed by Google.

It multiplies two 2x2 matrices with signed 8-bit (1 byte) elements to an output matrix with signed 16-bit (2-byte) elements. It does so in a systolic array circuit, where flow of data is facilitated through the connections between a grid of 4 Multiply-Add-Accumulate (MAC) Processing Elements (PEs).

To store inputs prior to computation, it contains 2 matrices in memory registers, which occupy a total of 8 bytes.

To orchestrate the flow of data between inputs, memory, and outputs, a control unit coordinates state transitions, loads, and stores automatically.

Finally, a feeder module interfaces with the matrix multiplier to schedule the inputs and outputs to and from the systolic array.

It is capable of running over 99.8 Million Operations Per Second when using a maximum throughput streamed processing pattern to multiply big matrices in 2x2 blocks.

System Architecture

Alt text

The Processing Element

Let's start from the most atomic element of the matrix multiplier unit (MMU): its processing element (PE). The value stored within each PE contributes an element to the output.

Signal Name Direction Width Description
clk input 1 The clock!
rst input 1 Reset
clear input 1 Clear PE
a_in input 8 Input value
a_out output 8 Pass-on of input
b_in input 8 Weight value
b_out output 8 Pass-on of weight
c_out output 16 Accumulation

Since each output element of a matrix multiplication is a sum of products, the PE's primary operation is a multiply-add-accumulate.

It will take input terms a_in and b_in, multiply them, and then add them to the accumulator value c_out. Due to the larger values induced by multiplication, the accumulator holds more bits.

Since adjacent PEs corresponding to adjacent elements of the output matrix need the same input and weight values, these input terms are sent to a_out and b_out respectively, which are connected to other PEs by the systolic array.

Once the multiplication is done, the control unit will want to clear the PEs so that they can reset the accumulation for the next matrix product, which is facilitated via the automatic clear signal.

On the other hand, it is non-ideal to reset the entire chip, as it wastes time (an entire clock cycle) and is unnecessary to reset other elements such as memory.

The output is 16 bits for the 8-bit inputs, to account for the property of multiplication.

The Systolic Array

Signal Name Direction Width Description
clk input 1 The clock!
rst input 1 Reset
clear input 1 Forwarded to PEs
activation input 1 Enables ReLU
a_data0 input 8 Input value of top-left PE
a_data1 input 8 Input value of bottom-left PE
b_data0 input 8 Input value of top-left PE
b_data1 input 8 Input value of top-right PE
c00 output 16 Top-left output value
c01 output 16 Top-right output value
c10 output 16 Bottom-left output value
c11 output 16 Bottom-right output value

The systolic array is a network, or grid, of PEs. In this 2x2 multiplier, the result is a 4-element square matrix, so there are 4 PEs.

Internally, the systolic array maintains internal registers and a matrix of accumulators that are read by and written into by the PEs.

This includes a 2x3 matrix for the input matrix values, a 3x2 matrix for the weight matrix values, and a 2x2 for the final output.

The extra values beyond 2x2 for input & weight matrix values are to allow the PEs at the edge of the grid to send their input/weight values to a register to a "garbage".

At each clock cycle, elements will flow between the PEs. The inputs will flow from "left to right", and the weights will flow from "top to bottom". To add new values, inputs have to be provided to the ports at the "left", and weights have to be provided to the ports at the "top".

Then, the PEs are instantiated using the Verilog compile-time construct of genvar, in which signals of the PE are connected to specified indices of the internal systolic array signals. Makes the code clean and easy to write!

Diagram of PE arrangement below:

Alt text

Unified Memory

The unified memory module (memory.v) is an on-chip store that holds both weight and input matrices for quick access to both values during computations.

Signal Name Direction Width Description
clk input 1 System clock
rst input 1 Active-high reset
load_en input 1 Enable signal for matrix loading
addr input 3 Memory address for matrix elements
in_data input 8 Current input matrix
weight[0,1,2,3] output 8 Weight matrix elements
input[0,1,2,3] output 8 Input matrix elements

The only source of the memory will come from ui_in ports directly. When load_en goes high, the current element (in_data) connected to all bits at the dedicated input ui_in is loaded into memory at its specified byte address, on the next rising edge.

Matrix elements are held internally within an 8x8 register (sram). In terms of outputs, sram[0..3] maps to weight[0..3], sram[4..7] maps to input[0..3]. An address (addr) is generated by the control unit to correctly load elements into the internal register.

The entire memory space of 8 bytes is visible at the output ports asynchronously.

Control Unit

The control unit (control_unit.v) serves as the central orchestrator for the entire TPU, coordinating the flow of data between memory, the systolic array, and output collection through a carefully designed finite state machine (FSM).

Signal Name Direction Width Description
clk input 1 System clock
rst input 1 Active-high reset
load_en input 1 Enable signal for matrix loading
mem_addr output 3 Memory address for matrix elements
mmu_en output 1 Enable signal for MMU operations
mmu_cycle output 3 Current cycle count for MMU timing
State Machine Architecture

The control unit implements a 3-state FSM that manages the complete matrix multiplication pipeline:

  1. S_IDLE (2'b00): The default waiting state where the system remains until a matrix multiplication operation is requested via the load_en signal.

  2. S_LOAD_MATS (2'b01): The matrix loading phase when 8 matrix elements (4 for each 2x2 matrix) are sequentially loaded into memory. Because the order is already assumed to be row-major order, left matrix first, then the address value tracked in mem_addr will only increment when load_en is asserted, from 0 up until 7 when it resets to get ready for future input matrices.

  3. S_MMU_FEED_COMPUTE_WB (2'b10): The computation and output phase, taking 8 cycles total when the systolic array performs the last few operations of matrix multiplication, and makes the 4 outputs available in 16 bits each, 1 every 2 cycles. At the same time, the chip is available for streamed processing so that 8 new elements, representing 2 new 2x2 matrices, can be input for the next round of outputs occurring right after the current round.

Orchestration Logic

The control unit coordinates several critical functions:

  • Memory Interface Management: Through the mem_addr output (control_unit.v:9), it generates sequential memory addresses (0-7) during the loading and streaming phase, ensuring matrix elements are stored in the correct memory locations for later retrieval.

  • MMU Timing Control: The mmu_cycle signal (control_unit.v:13) provides precise timing information to the MMU feeder module, enabling it to:

  • Feed correct matrix elements to the systolic array at appropriate cycles

  • Determine when computation results are ready for output

  • Clear processing elements after computation completion

  • Pipeline Coordination: The mmu_en signal (control_unit.v:12) acts as the master enable for the entire computation pipeline, transitioning from low during loading to high during computation phases. This is to ensure that elements are only loaded into the systolic array during the first round of set up inputs when all inputs are ready. Otherwise, if the chip is not initialized with all inputs in memory, it cannot complete computation and hence should not start it.

    • However, for maximum throughput, the mmu_en signal is asserted when 6 of the 8 elements making up 2 matrices are input, so that computation begins when we have the elements to produce enough outputs, and is overlapped with matrix loads, and completes in the middle of the output cycle.
  • Streamed Processing: During the 8-cycle output phase, the chip is available to take in 8 new input bytes provided at the ui_in ports. This is a streamlined flow of execution, as the input and output ports will henceforth be constantly used. After this 8-cycle output phase, the input bytes input during that phase can now begin outputting, while subsequent inputs can be further written.

    • However, if the user chooses not to write new inputs during the output phase, the outputs continue unabated, and the systolic array matrix accumulators automatically reset once the outputs are complete.
Critical Timing Relationships

The control unit implements sophisticated timing logic based on the systolic array's computational pipeline:

  • Cycle 0: Initial data feeding begins (a00×b00 starts)
  • Cycle 1: First partial products computed, additional data fed
  • Cycle 2: First result (c00) becomes available, Next input group can begin.
    • Future value of A00 expected at input during streamed processing
  • Cycle 3: Second and third results (c01, c10) become available simultaneously
    • Future value of A01 expected at input
  • Cycle 4: Final result (c11) becomes available
    • Future value of A10 expected at input
  • Cycle 5: All outputs remain stable.
    • Future value of A11 expected at input
  • Cycle 6: Output continues
    • Future value of B00 expected at input
  • Cycle 7: Output continues
    • Future value of B01 expected at input
  • Back to Cycle 0: Output continues. Since 6 of 8 of the next input elements would be on chip, the next cycle of data feeding begins
    • Future value of B10 expected at input
  • Cycle 1 again: Last output element, first partial products of next output are computed.
    • Future value of B11 expected at input
  • Cycle 2 again: Repeat of description above, etc.
State Transition Logic

State transitions are triggered by specific conditions:

  • S_IDLE → S_LOAD_MATS: When load_en is asserted (control_unit.v:30-32)
  • S_LOAD_MATS → S_MMU_FEED_COMPUTE_WB: When all 8 elements are loaded (mem_addr == 3'b111) (control_unit.v:37-38)
  • S_LOAD_MATS → S_MMU_FEED_COMPUTE_WB: When all 8 elements are loaded (mat_elems_loaded == 3'b111) (control_unit.v:37-38)
  • Afterwards, the state machine stays in S_MMU_FEED_COMPUTE_WB, but essentially cycles through counts of mem_addr and mmu_cycle to keep track of the memory address writing for streamed processing and maintain a rhythm for the Matrix Unit Feeder.
Integration with Other Modules

The control unit interfaces with all major TPU components:

  • Memory Module: Provides addressing (mem_addr) and coordinates write operations during loading
  • MMU Feeder: Supplies enable signal (mmu_en) and cycle timing (mmu_cycle) for data routing and output selection
  • Top-level TPU: Receives external load_en control signal and coordinates the entire operation sequence

This design ensures that matrix multiplication operations proceed automatically once initiated, with the control unit handling all timing dependencies and data flow coordination between the TPU's constituent modules.

The Matrix Unit Feeder

The Matrix Unit Feeder (in mmu_feeder.v) is the interface between the control unit and the computational unit (MMU), facilitating smooth data flow between the internal components of the TPU and outputs to the host. When enabled, its role is to either feed the expected matrix data from host to MMU, or to direct computed matrix outputs from MMU to host; this is decided based on the mmu_cycle defined and cycled through (0-7 repeating constantly) by the control unit.

Signal Name Direction Width Description
clk input 1 System clock
rst input 1 Active-high reset
en input 1 Enable signal for MMU operations
mmu_cycle input 3 Current cycle count for timing
weight[0,1,2,3] input 8 Weight matrix from memory
input[0,1,2,3] input 8 Input matrix from memory
c[00,01,10,11] input 8 Computed element output from MMU
done output 1 Signal to host that output is ready
host_outdata output 8 Data register to output to host
a_data[0,1] output 8 Output A to MMU for computation
b_data[0,1] output 8 Output B to MMU for computation

The weight and input matrices are taken from memory. The feeder will set the expected values of a_data0/1 and b_data0/1 depending on the value of mmu_cycle. Output values are 16 bits, but the maximum data width of the output ports is 8-bit so we have to feed half an element a cycle!

  • Cycle 0:
    • Blank data is sent to the output port. a_data0 = weight[0], b_data0 = input[0], done = 0
  • Cycle 1:
    • During the initial cycle when the memory is first populated, cycle 1 occurs during the input of the second matrix, which also overlaps with compute as the counter has already incremented.
    • This module will send a value to its output that is equal to the lower 8 bits of the product of the A00 and B00, which are the top-left elements of the matrices. This is due to the settings that enable streamed processing. The chip should therefore ignore that output.
    • In terms of feeding the systolic array, if fused transpose is disabled, then a_data0 = weight[1], a_data1 = weight[2], b_data0 = input[2], b_data1 = input[1].
    • If fused transpose is enabled, the b_data0 and b_data1 values are swapped.
  • Cycle 2:
    • In this cycle, the output counter is 0. The output sends the upper 8 bits of C00, the top-left output element.
    • To signal that the outputs are coming, the done signal is asserted, which is visible from the user.
    • The values given to the systolic array are a_data1 = weight[3], b_data1 = input[3].
  • Cycle 3:
    • In this cycle, the output counter is 1. The output sends the lower 8 bits of C00, the top-left output element.
    • Since we have finished sending all the last input values to the systolic array, intermediate feeds will have a value of 0, having no effect on the operation of MAC units.
  • Cycle 4:
    • In this cycle, the output counter is 2. The output sends the upper 8 bits of C01, the top-right output element.
  • Cycle 5:
    • In this cycle, the output counter is 3. The output sends the lower 8 bits of C01, the top-right output element.
  • Cycle 6:
    • In this cycle, the output counter is 4. The output sends the upper 8 bits of C10, the bottom-left output element.
  • Cycle 7:
    • In this cycle, the output counter is 5. The output sends the lower 8 bits of C10, the bottom-left output element.
  • Back to Cycle 0: (set by control unit)
    • In this cycle, the output counter is 6. The output sends the upper 8 bits of C11, the bottom-right output element.
    • The clear signal is asserted to make way for the computation of the next input to accumulate from 0.
    • Therefore, to preserve the lower 8 bits of C11 output in the next cycle, we assign the value to a "pipeline register", aptly named tail_hold as it holds the tail of the output.
    • The systolic array feeding pattern is the exact same as was shown above.
  • Cycle 1 Again:
    • In this cycle, the output counter is 7. The output sends the lower 8 bits of C11, the bottom-right output element.
    • The output count resets to 0 automatically.

For similar details on timing relationships, see Critical Timing Relationships above, in the Control Unit section.

How to test

Notation: the matrix element A_xy denotes a value in the xth row and yth column of the matrix A.

The module will assume an order of input of A matrix values and B matrix values, and outputs. That is, it is expected that inputs come in order of A00, A01, A10, A11, B00, B01, B10, B11, and the outputs will come in the order of C00, C01, C10, C11. This keeps the chip simple and avoids extra logic/user input.

Setup

  1. Power Supply: Connect the chip to a stable power supply as per the voltage specifications.
  2. Clock Signal: Provide a stable clock signal to the clk pin.
  3. Reset: Ensure the rst_n pin is properly connected to allow resetting the chip.

A Matrix Multiplication Round

  1. Initial Reset
    • Perform a reset by pulling the rst_n pin low to 0, and waiting for a single clock signal before pulling it back high to 1. This sets initial state values.
  2. Initial Matrix Load
    • Load 8 matrix elements into the chip, one per cycle. For example, if your matrices are [[1, 2], [3, 4]], [[5, 6], [7, 8]], you would load in the row-major-first-matrix-first order of 1, 2, 3, 4, 5, 6, 7, 8. This occurs by setting the 8 ui_in pins to the 8-bit value of the set matrix element, and waiting one clock cycle before the next can be loaded.
  3. Collect Output & Send Next Inputs
    • Thanks to the aggressive pipelining implemented in the chip, once the matrices are loaded, you can already start collecting output!
    • Output elements will be 16 bits each, but since the output port is only 8 bits, one element is output in 2 cycles, with the upper half (bits 15 - 8) in the first cycle, and the lower half (bits 7 - 0) in the second cycle.
    • To collect outputs, wait for a single clock edge, and then read the uo_out pin for the 8-bit value. Repeat again to get the full 16-bit value. Overall, the matrix output at uo_out will be in the order of c_00, c_01, c_10, c_11, taking 8 cycles to output 4 elements.
    • For the above example, the output would be in the order of [19, 22, 43, 50], starting from the cycle right after you finish your last load, and ending 8 cycles afterwards.
    • It is also recommended that in those same 8 cycles, the next 2 input matrices are sent to the ui_in pin. That will be 1 element per cycle for 2 serial, row-major 2x2 matrix inputs, for 8 cycles total.
  4. Repeat
  5. Input Options
    • Note that if new matrices are not input during the output cycle, i.e. the ui_in pin is set to 0, then it is the equivalent of "flushing the pipeline", as once the output is complete, it is the equivalent of starting at step 2.

Below is a visual of an example matrix multiplication round through the systolic array. Note that while it behaves similarly to the chip, the chip's matrix inputs to the systolic array are the diagram's order inverted across the matrix diagonal.

Alt text

Matrix Multiplication Options

The example shown above is a very simple and plain 2x2 matrix multiplication. However, this TPU chip offers additional options.

The first is the ability to compute the product $AB^T$, which is the first matrix multiplied by the transpose of the second. This saves time computing a transpose taken by a CPU instruction in $O(n^2)$ time, where n is a rough measure of the matrix dimension. Instead, it is fused with the entire process, taking no extra time.

The second is the ability to run the Rectified Linear Unit (ReLU) activation function, commonly seen in neural networks for approximating non-linear patterns in data.

The third, which is provided as a software interface option, is the ability to multiply bigger matrices, of all compatible dimensions, in 2x2 blocks.

Example Result

Below is a timing diagram showing signal progression for a streamed processing pattern of the TPU chip, generated by GTKWave:

Alt text

The part highlighting Input/Output streaming is the fact that mem_addr, which increments whenever load is enabled, keeps incrementing like 0-7, 0-7, etc, and so is the output count, albeit slightly offset in time.

One can also observe the pattern in which elements are fed into the systolic array, as seen by a_data0, a_data1, b_data0, b_data1 signals, and the output "waterfall" flow of output appearances seen inside c00, c01, c10, c11.

Scaling it up

Earlier, it was mentioned that you could scale the multiplication up to any dimension. What else does this mean? AI inference! We are able to run forward inference of a Quantization-Aware-Trained (QAT) machine learning model using the chip's logic.

The model is trained to recognize black-and-white images of single handwritten digits from the MNIST dataset.

In the demonstration, which is kept simple, I run QAT on the local PC, and then run forward inference with this model on the chip. It was able to successfully recognize 2 out of 3 images in a test batch, which is far superior to a coin flip.

External hardware

An external microcontroller will send signals over the chip interface, including the clock signal, which will allow it to coordinate I/O on clock edges.

Acknowledgements

  • William Zhang: Processing Elements, Systolic Array, Module Compilation & Integration, Pipelining Optimization
  • Ethan Leung: Matrix Unit Feeder
  • Guhan Iyer: Unified Memory
  • Yash Karthik: Control Unit
  • ECE 298A Course Staff: Prof. John Long, Prof. Vincent Gaudet, Refik Yalcin

An earlier iteration of this project is located at this repository, in which the original plan was to submit to the IHP25B shuttle.

IO

#InputOutputBidirectional
0IN0OUT0LOAD_EN (input)
1IN1OUT1TRANSPOSE (input)
2IN2OUT2ACTIVATION (input)
3IN3OUT3Unused
4IN4OUT4Unused
5IN5OUT5Unused
6IN6OUT6Unused
7IN7OUT7DONE (output)

Chip location

Controller Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Analog Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Analog Mux Mux Mux Mux Analog Mux Mux Mux Mux Mux Mux tt_um_chip_rom (Chip ROM) tt_um_factory_test (Tiny Tapeout Factory Test) tt_um_oscillating_bones (Oscillating Bones) tt_um_wokwi_457142813149930497 (TinyTapeOut workshop) tt_um_wokwi_457311688017142785 (tiny tapeout test gates) tt_um_bfcpu (bfCPU) tt_um_rebeccargb_universal_decoder (Universal Binary to Segment Decoder) tt_um_rebeccargb_hardware_utf8 (Hardware UTF Encoder/Decoder) tt_um_rebeccargb_intercal_alu (INTERCAL ALU) tt_um_rebeccargb_vga_pride (VGA Pride) tt_um_wokwi_457215959798165505 (4-bit N frequency divider) tt_um_ppu_aebarthyi (simple_ppu) tt_um_riscyv02 (RISCY-V02) tt_um_pong (Pong) tt_um_LH_TapeoutMultiplier (tt_um_LH_TapeoutMultiplier) tt_um_wokwi_457571222315471873 (7 Seg C) tt_um_wokwi_457571216758012929 (Mikes Second) tt_um_wokwi_457569452934172673 (FirstTinyTapeoutWokwiProject) tt_um_wokwi_457571159626309633 (Tiny Tapeout V1) tt_um_wokwi_457577038845586433 (TinyTapeOut) tt_um_wokwi_457571280506256385 (Tiny tapeouts test gates) tt_um_wokwi_457576742418338817 (calculator) tt_um_wokwi_457571219715001345 (Ami's TT Logic Gates) tt_um_wokwi_457571067547656193 (Mikes First Wokwi design) tt_um_wokwi_457571453314827265 (Tiny tapeout one hot to seven segment display 1-8) tt_um_wokwi_457572218833202177 (4bit adder and hex converter) tt_um_wokwi_463741407580251137 (Lady's First Tapeout) tt_um_wokwi_457571138696714241 (jdisplayer) tt_um_wokwi_457570687900145665 (Tiny Tapeout Test Gates) tt_um_wokwi_457577511431565313 (Tiny Tapeout Test Gates) tt_um_wokwi_457571262875481089 (Tiny Tapeout) tt_um_wokwi_457571417674762241 (TamTries Tiny Tapeout) tt_um_wokwi_457571366985520129 (georgies wokwi design) tt_um_wokwi_457571701752981505 (WilfTT) tt_um_wokwi_457572875733692417 (First WOKWI Design) tt_um_wokwi_457571571887847425 (tiny tapeout gate test) tt_um_wokwi_457571352249873409 (First Wokwi design) tt_um_wokwi_457571405919170561 (Namo's first tapeout) tt_um_wokwi_457571339952163841 (OR Gate with NAND) tt_um_wokwi_457571188658258945 (Abishag's first Wokwi Design) tt_um_wokwi_457571426719781889 (Tiny) tt_um_wokwi_457571268900604929 (tiny tape GDS) tt_um_wokwi_457571949070179329 (Tom Haley Tiny Tape Out Design ) tt_um_alex_ha_192 (alex_ha_192) tt_um_wokwi_457577241913913345 (tiny tapeout test gates ) tt_um_wokwi_457571297367365633 (First Wokwi Attempt) tt_um_wokwi_457571363309211649 (idk yet) tt_um_wokwi_457571305740256257 (Work In progress title) tt_um_wokwi_457579594627462145 (TinyTapeoutProjectDefne) tt_um_wokwi_457571274041781249 (Tiny Tapeout Workshop by Kirsty Tan) tt_um_wokwi_457571233499594753 (Tiny Tapeout Workshop) tt_um_wokwi_457570205537212417 (Tiny Tapeout Test Project) tt_um_ojas_sharma_imperial_ttcpu (ttcpu 4-bit RISC microprocessor) tt_um_wokwi_457571271419289601 (chip one) tt_um_wokwi_457573490746716161 (Name Serial Printer) tt_um_wokwi_457569507958215681 (Tiny tapeout proj) tt_um_wokwi_457577929607958529 (Random 1st Attempt) tt_um_wokwi_457571438667259905 (PD+PFD+FreqDiv) tt_um_wokwi_457571602706552833 (Joe's first Wokwi design) tt_um_wokwi_457571471659666433 (Nicolas' first Wokwi design) tt_um_wokwi_457571148733696001 (Tiny Tapeout Workshop 1) tt_um_wokwi_457572520479222785 (Tiny Tapeout: Buenos días Mundo! ) tt_um_wokwi_457571494688497665 (First Chip) tt_um_wokwi_457571341266031617 (D-Type Flip Flop) tt_um_wokwi_457581344351934465 (WOKWI) tt_um_wokwi_457571462196267009 (Tiny Tapeout) tt_um_wokwi_457571359410603009 (TinyTapeout) tt_um_Terdoo_Osu (Spiking Pattern Recognition Core) tt_um_wokwi_457571319408448513 (Mani TinyTapeout) tt_um_wokwi_457571298662360065 (Tiny Tapeout Test Gates) tt_um_wokwi_457573015156590593 (Lil tapeout) tt_um_wokwi_457576363047649281 (Inverter) tt_um_wokwi_457571216488527873 (Tiny Tapeout Template Copy Paul 1) tt_um_wokwi_457571472208072705 (Tiny Tapeout Test design) tt_um_wokwi_457571381968631809 (Tiny tapeout test) tt_um_wokwi_457571314694049793 (Tiny Tapeout Test) tt_um_wokwi_457571368009979905 (Tiny Tapeout Test Gates) tt_um_wokwi_457571389542502401 (First thing) tt_um_wokwi_457570267471381505 (Tiny Tapeout) tt_um_wokwi_457571563051492353 (CS First Wokwi design) tt_um_wokwi_457577392775721985 ( tiny Tapeout Test Gate) tt_um_wokwi_457570279596067841 (Tiny Tapeout Workshop - AJJ) tt_um_wokwi_457571180646081537 (Alins Password) tt_um_wokwi_457572360568198145 (Tiny Tapeout) tt_um_wokwi_457571270578328577 (Tiny tapeout workshop) tt_um_wokwi_457581625098771457 (Tiny Tapeout First Test Run) tt_um_wokwi_442342513281875969 (First Design) tt_um_wokwi_457581848269362177 (Tiny Tapeout Brainf*ck?) tt_um_sap_alexanderholden (sap1) tt_um_wokwi_457571752214675457 (3bit_ALU) tt_um_wokwi_457571542558115841 (Tiny Tapeout") tt_um_wokwi_457573095390500865 (Tiny Tapeout Workshop Counter) tt_um_wokwi_457571511812802561 (Akash's first Wokwi design) tt_um_wokwi_457577563633889281 (Tiny Tapeouts gate tests) tt_um_wokwi_457576950671858689 (Hymns_GDS) tt_um_wokwi_457571371384299521 (Digital digit display circuit - TINYTAPEOUT) tt_um_rowantylerr_RC_TDC (Resistor Capacitor TDC) tt_um_wokwi_463662181299058689 (2 bit ALU) tt_um_chinghey (Hey FlexCompute-130) tt_um_8b10 (serdes8b10) tt_um_rom_vga_screensaver (VGA Screensaver with embedded bitmap ROM) tt_um_mayamelon_top (Tiny PI Controller) tt_um_JAIMEPRYOR0_VGA_YAY (VGA_YAY) tt_um_2048_vga_game (2048 sliding tile puzzle game (VGA)) tt_um_mng2_2ncos (A Tale of Two NCOs) tt_um_shimmydee_checkers (One-tile ADC) tt_um_urish_simon (Simon Says memory game) tt_um_dheeeraaj_sine_chirp_beacon (DDS Sine Chirp Beacon) tt_um_nicholas_ls194a (Universal Shift Register (SN74LS194A compatible)) tt_um_BellaB05_Hearts (Pink Hearts) tt_um_scottshuynh_ad_astra (ASIC Ad Astra) tt_um_liamolucko_vga (VGA demo) tt_um_lledoux_s3fdp_seqcomb (S3FDP Seq+Comb Stream Core) tt_um_5482582_cat_vga (Cat VGA) tt_um_vga_example_directional_toggle (Directional toggle of VGA playground example) tt_um_jimbok_ro_puf (Ring Oscillator PUF) tt_um_xxsahanaxx_hwsec_glitch (Hardware Security Glitching Attack) tt_um_NguyenHuuHenry_vga_project (VGA_Project) tt_um_irfantekin_analog (tt_um_irfantekin_analog) tt_um_chicagojones_sky26a_trng (Sky26a Advanced TRNG) tt_um_yen (YEN) tt_um_pedometer (Ultra Low Power Pedometer ASIC) tt_um_analog_atenfyr1 (Configurable Self-biasing Miller-compensated OTA) tt_um_aes_sbox (Formally-Verified Constant-Time AES S-Box) tt_um_tcpu_alienflip (tcpu) tt_um_nebula (Sierpinski Fractal Starfield) tt_um_zenith_tx26 (Zenith TX26) tt_um_odgrip_demoscene_ttsky26a (My first demoscene) tt_um_vighnesh_sawant_plane (Plane with a banner) tt_um_glyph_mode_hd (Glyph Mode HD) tt_um_TSARKA_TinyQV (TinyQV Wishbone SoC) tt_um_SimpleCounter (Simple Counter) tt_um_cfar_nobuzzer (CFAR Detector without Buzzer) tt_um_present (Present) tt_um_top (Approximate Logic Unit) tt_um_goose (OIIA-goose) tt_um_riscv_core (Tiny RISC-V) tt_um_dac_test3v3 (Design and Implementation of R-2R Ladder DAC for GPR Application) tt_um_tadc_its (Time Domain ADC) tt_um_algofoogle_vga_matrix_dac (Analog VGA CSDAC experiments) tt_um_jyblue1001_pll (Analog-PLL) tt_um_axi4lite2x2_top (AXI4-Lite 2M-2S Interconnect) tt_um_systolic_top (4x4 Systolic Matrix MAC Accelerator) tt_um_goose_game (Goose Game) tt_um_rongbin99_happyredmapleleaf_audio_chip (Audio Wave Generator Chip) tt_um_fp_id (FinSec-1: AS-68M Fingerprint Verification ASIC) tt_um_game_of_life (Demoscene: Game of Life) tt_um_ds_missile_command (Missile Command) tt_um_cmos_inverter (Reactive Plasma: CMOS Inverter) tt_um_nightplumeaki_tinypipcore (tinypipcore) tt_um_immrudul_w7khan (Mrudul and Wahhaj Demoscene F2025) tt_um_sohamgovande_transformer (Transformer) tt_um_isa084_uart_servo (UART Positioning PWM Interface) tt_um_wokwi_461265571826974721 (Bias Correction Filter) tt_um_8_bit_cpu (8-bit CPU) tt_um_richad (ADPPLS) tt_um_algofoogle_dottee (DOTTEE VGA demo) tt_um_sar_fms (SAR FSM) tt_um_kolontsov_journey (Journey) tt_um_fft_adityaamehra (64 Sample FFT ASIC) tt_um_lambda_clock (Lambda Clock) tt_um_ece298A_analog (ECE298A analog tile) tt_um_toivoh_demo (Orion Iron Ion [TTSKY26a demo competition]) tt_um_kilian_interference (Wave Lattice) tt_um_fabulous_sky_26a (Tiny FABulous FPGA) tt_um_Rats2012_WobblyBits (WobblyBits - A probabilistic computing chip) tt_um_rebelmike_asic_odyssey (2026: An ASIC Odyssey) tt_um_huyatieo_tinyqv_speck (Speck-V SoC) tt_um_mosbius (mini mosbius) tt_um_remedy_cpu (FFD16 cpu 16-bit) tt_um_vga_ocarina (Ocarina on VGA) tt_um_TinyGPU_v3 (Tiniest GPU V3) tt_um_santhosh_ring_osc (Ring Oscillator PVT Sensor & TRNG) tt_um_santhosh_xbar_ctrl (Memristive Crossbar Peripheral Controller) tt_um_santhosh_stdp_ctrl (Digital STDP Learning Controller) tt_um_santhosh_stoch_neuron (LFSR-Based Stochastic Neuron) tt_um_anweiteck_ldo (1V-LDO) tt_um_sriaxi4lite_top (Axi4_Lite) tt_um_bch_code_15_7_2 (Bose-Chaudhuri-Hocquenghem Code) tt_um_mastensg_ttsky26a_demo (Luz) tt_um_pakesson_vga_rocket (VGA Rocket) tt_um_adpll (ADPLL - All-Digital Phase-Locked Loop) tt_um_Bingyao_FCOTA (Self biased Single Ended Folded Cascoded OTA) tt_um_spacewar_top (Spacewar) tt_um_microlane_demo (microlane demo project) tt_um_NE567Mixer28 (ECG Front End) tt_um_wakita_mux8onehot_cap (Mux8onehot Pulldown Mosfet) tt_um_johshoff_metaballs (Metaballs v2) tt_um_tomvdsch_cyclonerunner (CycloneRunner) tt_um_lowprocess_wildcamping (PicoMIPS CPU) tt_um_canvas (Tiny Canvas) tt_um_snrlxd1068_MACs (Linear and Logarithmic MACs) tt_um_pakesson_simon64_128 (SIMON64/128) tt_um_AmitChen1415 (Tiny Blackjack) tt_um_ole_moller_double_dabble_SV (double_dabble_SV) tt_um_toivoh_demo_1tile (Single tile demo [TTSKY26a demo competition]) tt_um_shiho_space_invaders (Tiny Space Invaders) tt_um_analog_RO (Analog RO) tt_um_electron65_vga (VGA Clock Demo) tt_um_wokwi_457571266840151041 (3-Bit ALU) tt_um_katomata (Katomata - 1D Cellular Automata) tt_um_shimomi_analog (analog circuit) tt_um_toivoh_demo_4tile (Four tile demo [TTSKY26a demo competition]) tt_um_IEEE_open_silicon_FOSSEE (Ring oscillator VCO and Differential Amplifier) tt_um_lm_chip_top (Project Long Man: A Delay-Insensitive Interconnect) tt_um_AlephNaNsea_space_time_waves_and_filaments (Space-Time Waves and Filaments) tt_um_spacelizard_apu (Spacelizard APU) tt_um_wokwi_457569490272926721 (Letter S) tt_um_mau_top_4b (SIMD2 Math Accelerator Unit) tt_um_maze (Maze) tt_um_demoscenettsky (Algorithmic Pattern Generator) tt_um_wokwi_457572141968369665 (Arran's tinytapeout project) tt_um_maxluppe_ttsky26a_analog (Standard Digital Logic Cells Analog Comparator) tt_um_grammartile (GrammarTile) tt_um_bubble_sort (IEEE Bubble Sort Engine) tt_um_ahmed_nematallah_12_bit_adc (12-bit ADC) tt_um_bad_ode_plotter_vga (Bad VGA ODE Plotter) tt_um_wokwi_463706339714973697 (Demo 4-bit ALU 74181 variant) tt_um_wokwi_457569853853115393 (Jasper Tiny Tape Out Workshop) tt_um_wokwi_457560507752701953 (Osian Tiny Tapeout) tt_um_wokwi_457571501325987841 (Rola_Tiny Tapeout Template Workshop4Mar26) tt_um_wokwi_457571903121572865 (TT-wokwi-template) tt_um_wokwi_463380823859050497 (My_Name_on_7_Seg_display) tt_um_wokwi_457569584731832321 (Tiny Tapeout 9 Template Copy) tt_um_wokwi_457571826952995841 (Tiny Tapeout Novomorphic Design 1) tt_um_wokwi_457571349142937601 (Tiny Tapeout Secret First Letter Code) tt_um_wokwi_457571261877235713 (Tiny Tapeout Test) tt_um_wokwi_457582867322921985 (Tiny Tapeout Test GDS) tt_um_wokwi_457571135132600321 (Tiny Tapeout Test Gates) tt_um_wokwi_457571331577181185 (Tinytapeout_IA) tt_um_wokwi_457576779101727745 (tiny tapeout test gates) tt_um_wokwi_457571577702202369 (tj wowki) tt_um_wokwi_457572953060951041 (wokwi) tt_um_pettit_galton (Tiny Galton) tt_um_fountaincoder_top_abc (ABC Temporal Coincidence Detector) tt_um_prime_quine (Prime Quine) tt_um_ghtag_trinity_gf16 (Trinity GF16 Dot Product Accelerator) tt_um_LFSR (Configurable Galois LFSR) tt_um_Acrazt05_titan_proccesing_unit (Titan Proccesing Unit (TPU)) tt_um_essen (Digital) tt_um_alu_bns (6-bit Multi-Functional ALU) tt_um_gerardvt_spade_poc (Interactive XOR Plasma (Spade HDL)) tt_um_gerardvt_clash_poc (Interactive Triangle-Wave Plasma (Clash HDL)) tt_um_jackthoene_frogger (Frogger) tt_um_wokwi_463698873100105729 (IEEE Open Silicon 2026: UTB Logic Trivia Challenge: 8-bit Digital Lock) tt_um_wokwi_463666635153364993 (IEEE - Hex Counter and Logic Gate Validator) tt_um_ChristmasTree_MaligayangPasko (ChristmasTree_MaligayangPasko) tt_um_wokwi_463711763041599489 (IEEE Open Silicon 2026: UTB UART Transmitter basic) tt_um_tinytensorcore (TinyTensorCore) tt_um_uwasic_crypto (UWASIC Crypto) tt_um_topadi (time) tt_um_siliconimist (Siliconimist Demoscene) tt_um_neutern_0 (tt_um_neutern_0) tt_um_htfab_hsxo (HSXO) tt_um_madech_8bit_processor_vga (8-Bit Processor with VGA) tt_um_vga_clock (VGA clock) tt_um_usu_AXIS_MVMul (AXI-Stream Matrix Vector Multiplier) tt_um_weird_numbers (Weird Numbers) tt_um_bovi_cable_tester (Cable Tester) tt_um_libokuohai_asap_cpu_v2 (ASAP CPU v2) tt_um_LinusSkucas_pio (Tiny PIO) tt_um_thomas_ep_sensor (EP Sensor v7 (symmetric in-place thicken, Zhao-compliant)) tt_um_rakhanaufm_truerandom (Current-Starved Ring Oscillator Based True Random Number Generator) tt_um_parakeet (parakeet) tt_um_mcml_vco (MCML experiments) tt_um_tpu ( Tensor Processing Unit) tt_um_strasti (8-Bit ALU) tt_um_zed_analog (Analog design) tt_um_axi4lite_top (Axi4_Lite) tt_um_c4m_spsram_direct (TTSKY-SPSRAM-direct) tt_um_Onchip_Folded_Cascode_N_with_Bias (Folded Cascode N Type with Bias from Onchip Research Group) tt_um_htfab_hybrid (Telephone hybrid) tt_um_ilamparuthi_cfar (CFAR Radar Detector) tt_um_pakesson_glitcher (Glitcher) tt_um_advaittej_stopwatch (V-SPACE Demo: Command & Control Chronograph) tt_um_william_pll (Smartcard PLL Clock Generator) tt_um_Melody_Generator_JLANordhal (Melody Generator based on Markov Chains) tt_um_d_monteiro (Neuromorphic Processor (SNN)) tt_um_jacob_kebaso_4bit_cpu (Nibble - 4-bit CPU) tt_um_signal_detector (Signal_Detection_Processor) tt_um_catalinlazar_tinycore8 (TinyCore8) tt_um_chidam_secengine (Tiny Secure Telemetry Engine) tt_um_urish_usb_cdc (USB CDC (Serial) Device) tt_um_josenbm (9-Channel Frequency Counter with I2C + SPI DAC & ADC) tt_um_shalindra_vga_rings (Variable Speed and Colour Select VGA Rings) tt_um_dinukuk_MYVGA_GLIDER (DKTT01 - VGA Glider) tt_um_fibonacci_JoaoBortolace (Fibonacci Counter) tt_um_wokwi_461639934990157825 (4 bit unlock (IEEE)) tt_um_ctw_ldo (LDO Regulator Skywater 130nm)