This project is a neural network accelerator designed for use with convolutional neural networks. The verilog is generated from system verilog source which lives in a separate repository: https://github.com/GregAC/tiny-nn which also contains the full DV environment, model, documentation and related utilities and software.
Internally it contains a number of 16-bit floating point add and multiply units (using something approximating the BF16 floating point encoding) that can be configured to work in different ways for different operations. Operations available are
The interface is a fixed 16 bits in and 8 bits out synchronous to the clock. Each operation has a special operand code that starts it that needs to be sent on the 16 input bits. Once started the 16 input bits provide the numbers used for the operation.
The 16-bit numbers output are split over 2 clock cycles for the 8 bit output. With the lower byte output first. The user needs to know when the output is relevant (some cycles the output should be ignored and some it should be captured).
https://github.com/GregAC/tiny-nn should contain full documentation with the details and software to use the accelerator (both a work in progress at tapeout time!).
There are 3 test modes to test basic input output.
Place 16'hFFFF on the input {ui_in, uio_in} and hold it and on the output you will observe a repeating pattern:
This is 'T-NN' in ASCII
Place 16'hF000 on the input {ui_in, uio_in} and hold it and on the output you will observe a repeating pattern:
Place 16'hF1XX on the input {ui_in, uio_in} where XX is any 8-bit number and on the output you will observe a count down from that number.
The simplest operation is the accumulate one. We'll configure it to add two numbers at a time with a -3.5 bias and RELU. Then we'll add 1.0 + 2.0 and 3.0 + 4.0. Put the following on the input over successive clocks
On the output you should observe:
The 16'hX outputs could be anything and should be ignored, the first number output is 0000 representing 0.0 RELU(1.0 + 2.0 - 3.5) = 0.0, the second number output is 4060 representing 3.5 RELU(3.0 + 4.0 - 3.5) = 3.5.
No specific external hardware required but it does need some external part to drive the desired sequences, this can be handled by the RP2040 on the demo board.
# | Input | Output | Bidirectional |
---|---|---|---|
0 | ui[0] | uo[0] | uio[0] |
1 | ui[1] | uo[1] | uio[1] |
2 | ui[2] | uo[2] | uio[2] |
3 | ui[3] | uo[3] | uio[3] |
4 | ui[4] | uo[4] | uio[4] |
5 | ui[5] | uo[5] | uio[5] |
6 | ui[6] | uo[6] | uio[6] |
7 | ui[7] | uo[7] | uio[7] |