
A minimal neural processing unit with 4 parallel multiply-accumulate datapaths that computes a single fully-connected layer: y = ReLU(W · x + b).
The NPU integrates custom arithmetic units generated by SproutHDL through ML-guided design-space exploration:
Since the multipliers are unsigned but the NPU handles signed INT8 values, a lightweight sign-management wrapper computes absolute values, multiplies, and conditionally negates the result. A pipeline register between the multiply and accumulate stages ensures clean timing closure.
Specifications:
rst_n low → high{n_out-1}[5:4] | {n_in-1}[2:0]The arithmetic units were produced by SproutHDL (github.com/huawei-csl/sprout-hdl) as part of a semester project on ML-guided design-space exploration of AI hardware architectures. The Han-Carlson multiplier and sparse Kogge-Stone adder represent Pareto-optimal points on the area-delay tradeoff frontier identified through automated exploration.
None required.
| # | Input | Output | Bidirectional |
|---|---|---|---|
| 0 | data_in[0] | data_out[0] | cmd[0] |
| 1 | data_in[1] | data_out[1] | cmd[1] |
| 2 | data_in[2] | data_out[2] | cmd[2] |
| 3 | data_in[3] | data_out[3] | cmd[3] |
| 4 | data_in[4] | data_out[4] | |
| 5 | data_in[5] | data_out[5] | |
| 6 | data_in[6] | data_out[6] | |
| 7 | data_in[7] | data_out[7] |