493 Tiny NPU: 4-Way Parallel INT8 Inference Engine :: Quicker, easier and cheaper to make your own chip!

493 : Tiny NPU: 4-Way Parallel INT8 Inference Engine

Author: Malik

Description: 4-way parallel INT8 neural network inference engine with SproutHDL arithmetic units and ReLU activation

Clock: 50000000 Hz

Tiny NPU: 4-Way Parallel INT8 Inference Engine

How it works

A minimal neural processing unit with 4 parallel multiply-accumulate datapaths that computes a single fully-connected layer: y = ReLU(W · x + b).

The NPU integrates custom arithmetic units generated by SproutHDL through ML-guided design-space exploration:

4× Han-Carlson multipliers (8-bit unsigned, structurally optimized)
4× Sparse Kogge-Stone adders (24-bit two's complement, area-optimized)

Since the multipliers are unsigned but the NPU handles signed INT8 values, a lightweight sign-management wrapper computes absolute values, multiplies, and conditionally negates the result. A pipeline register between the multiply and accumulate stages ensures clean timing closure.

Specifications:

4 parallel datapaths (one per output neuron)
Weight storage: 32 × INT8 register file (4 outputs × 8 inputs)
Bias storage: 4 × INT16
Input buffer: 8 × INT8 activations
24-bit accumulator precision per output
Configurable: 1–8 inputs, 1–4 outputs
ReLU activation with INT8 output saturation
Pipelined inference: N_IN + 3 cycles (8 inputs → 11 cycles @ 50 MHz = 220 ns)
Weights persist across inferences for batch processing

How to test

Reset: rst_n low → high
Config: cmd=0x1, data = {n_out-1}[5:4] | {n_in-1}[2:0]
Load weights row-major: cmd=0x2, data=weight (auto-increments)
Load biases: cmd=0x3, data=bias (auto-increments)
Load activations: cmd=0x4, data=act (auto-increments)
Run: cmd=0x5. Wait for busy→0.
Reset pointers: cmd=0x7
Read outputs: cmd=0x6 (auto-increments)

Design context

The arithmetic units were produced by SproutHDL (github.com/huawei-csl/sprout-hdl) as part of a semester project on ML-guided design-space exploration of AI hardware architectures. The Han-Carlson multiplier and sparse Kogge-Stone adder represent Pareto-optimal points on the area-delay tradeoff frontier identified through automated exploration.

External hardware

None required.

#	Input	Output	Bidirectional
0	data_in[0]	data_out[0]	cmd[0]
1	data_in[1]	data_out[1]	cmd[1]
2	data_in[2]	data_out[2]	cmd[2]
3	data_in[3]	data_out[3]	cmd[3]
4	data_in[4]	data_out[4]
5	data_in[5]	data_out[5]
6	data_in[6]	data_out[6]
7	data_in[7]	data_out[7]

Input

Output

Bidirectional

data_in[0]

data_out[0]

cmd[0]

data_in[1]

data_out[1]

cmd[1]

data_in[2]

data_out[2]

cmd[2]

data_in[3]

data_out[3]

cmd[3]

data_in[4]

data_out[4]

data_in[5]

data_out[5]

data_in[6]

data_out[6]

data_in[7]

data_out[7]

Chip location

493 Tiny NPU: 4-Way Parallel INT8 Inference Engine