
This design implements a 32-bit multiply-accumulate (MAC) unit for tiny ML accelerators. A 7×8-bit multiplier produces a 15-bit product every clock. A controller FSM shifts partial sums and aligns bias bytes across four training phases (or one phase in inference).
| Signal | Role |
|---|---|
ui_in[6:0] |
Activation (latched while rst_n is low) |
ui_in[7] |
Mode: 0 = inference, 1 = training |
ui_in[7:0] |
Weight byte (operand B) each active cycle |
uio_in[7:0] |
Bias byte (training); uio drives result[23:16] in inference |
uo_out[7:0] |
result[31:24] |
out_th is set when any of product[14:11] are high (large partial product).
Normal path (full cycle):
S0 capture product → S1 → S2 → S3 → S4 finalize → S1
Early exit when out_th is set:
| Current state | Next path |
|---|---|
| S1 | S5 → S6 → S7 → S1 |
| S2 | S6 → S7 → S1 |
| S3 | S7 → S1 |
States S5–S7 use the same bias alignment as S2–S4 but skip the remaining normal phases.
S0 captures the first product, then the FSM holds in S1. Each cycle accumulates {0, product, bias, 0} into sum, shifts result left by 8, and drives uio_out.
From test/:
make clean && make
Cocotb tests in test/test.py compare the RTL against test/mac_reference.py for:
No external hardware is required beyond the Tiny Tapeout carrier. A host MCU can stream weights and bias bytes on the GPIO pins.
| # | Input | Output | Bidirectional |
|---|---|---|---|
| 0 | ui_in[0] | uo_out[0] | uio_in[0] |
| 1 | ui_in[1] | uo_out[1] | uio_in[1] |
| 2 | ui_in[2] | uo_out[2] | uio_in[2] |
| 3 | ui_in[3] | uo_out[3] | uio_in[3] |
| 4 | ui_in[4] | uo_out[4] | uio_in[4] |
| 5 | ui_in[5] | uo_out[5] | uio_in[5] |
| 6 | ui_in[6] | uo_out[6] | uio_in[6] |
| 7 | mode | uo_out[7] | uio_in[7] |