Combinational multiply / divider unit (8bit+8bit input)
This is an updated version of the original project that was submitted and manufactured in TT04 (https://github.com/dlmiles/tt04-muldiv4). The previous project was hand crafted in Logisim-Evolution then exported as verilog and integrated into a TT04 project.
This version is the same design, extended to 8-bit wide inputs, but instead of hand crafting the logic gates in a GUI we convert functional blocks into SpinalHDL language constructs. Part of the purpose of this design is to understand the area and timing changes introduced by adding more bits, then to explore alternative topologies.
The goal of the next iteration of this design maybe to introduce a FMA (Fused Multiply Add/Accumulate) function and ALU function to explore if there is some useful composition of these functions (that might be useful in an 8bit CPU/MCU design, or scale to something bigger). The next iteration on from this could explore how to draw the transistors directly (instead of using standard cell library) for such an arrangement, this may result in non-rectangular cells that interlock to improve both area density and timing performance. Or it might go up in smoke... who knows.
Due to the limited total IOs available at the external TT interface it is necessary to clock the project and setup UI_IN[0] to load each of the 2 8-bit input registers.
The input side uses latches to capture, which means during the appropiate phase CLK (high) and ADDR state, it alternatively opens/closes, the data is becomes captured into the latches at the CLK NEGEDGE. During the whole time it is open and closed it is providing the data into the appropiate input side of both MUL and DIV units (which are seperate logic modules).
The result becomes immediately available (after propagation and ripple settling time) at the outputs. While the latch it open, maybe artificially by extending duty-cycle of CLK, you should also be able to conduct experiments on modifying input and observing output (when in immediate result mode)
The result output is also multiplexed and has an immediate and registered mode. The immediate mode provides a direct visibility of the MUL/DIV combintational output and should allos timing between input and outputs to be observed. (you need to account for address multiplex of high-low 8bit sides of result). The registered mode capture the result in full at the time of the last ADDR and a CLK posedge. This allows you to change the values for the input side during the next few cycles, while the module ensures to sustain the result value of the last computation at the output. With an appropiate pipeline interleave request and result information to achieve higher throughput.
FIXME
FIXME please check out the original github for any enhanced documentation for this project, potentially improved information nearer PCB+IC delivery (to customer) schedule but also post-production post-physically testing results and information. I hope to produce some kind graphs showing the timing capture and reliability to show and demonstrate the cascade effect. This assumes I have the design correct to allow this to happen, but there are some tricks (like extending CLK on-duty cycle when latches are open) enough to see result capture output.
FIXME provide wavedrom diagram (MULU, MULS, DIVU, DIVS)
FIXME explain IMMediate mode and REGistered mode (to pipeline)
FIXME provide blockdiagram of functional units
// D
// MUX
// X Y registers (loaded from multiplexed D)
// OP -> res flags
// P P registers
// DEMUX
// R
FIXME explain architective difference to previous example and considerations why to change.
FIXME explain addressing mode to allow much wider units and potentially uneven input sizes.
Multiplier (signed/unsigned) Method uses Ripple Carry Array as 'high speed multiplier' Setup operation mode bits MULDIV=0 and OPSIGNED(unsigned=0/signed=1) Setup A (multiplier 8-bit) * B (multiplicand 8-bit) Expect result P (product 16-bit)
Divider (signed/unsigned) Method uses Full Adder with Mux as 'combinational restoring array divider algorithm'. Setup operation mode bits MULDIV=1 and OPSIGNED(unsigned=0/signed=1) Setup Dend (dividend 8-bit) / Dsor (divisor 8-bit) Expect result Q (quotient 8-bit) with R (remainder 8-bit)
Divider has error bit indicators that take precedence over any result. If any error bit is set then the output Q and R should be disregarded. When in multiplier mode error bits are muted to 0. No input values can cause an overflow error so the bit is always reset.
Please check back with the project github main page and the published docs/ directory. There is expected to be some instructions provided around the time the TT05 chips a received (Q4 2024).
At the time of writing receiving a physical chip (from a previous TT edition) back has not occured, so there is no experience on the best way to test this project, so I defer the task of writing this section to a later time.
There should be sufficient instructions here start you own journey.
It is expect the RP2040 and a Python REPL should be sufficient test this project.
uio_in[3] might moved to bit4 and DIV0/OVER combined into bit5 This would allow the address the contigious area below. However during a test build of a MULDIV16 version it easily exceeds 1x1, as this stage looking towards making builds with permutations of design/topology and method to generate GDS. So 1x1 is good to achieve this.
The uio_in[3] feature wants to use registered mode to lock result when last address is clocked in this way we can pipeline result and demonstration of what pipelining can do to increase thoughput.
The TB is limited to the 4bit version. Ran out of time to validate registered output and pipeline.
Encapsulate the SpinalHDL Scala netlist generation, and write a yosys JVM module harness (a yosys C++ module that is a JVM thread/process runner, with communication interface, data/ffi API/lifecycle). Then write a yosys plugin that allows it to directly include, use and call for generated data based on parametric details.
Consider emitting a custom cell/macro/GDS_object that yosys can call for, then emit verilog like a regular standard cell module.
Consider modifying OpenROAD/OpenLane to incorporate generated macros directly into other detailed routing environment then have the existing detailed routing work around it as-is.
Fixup the original logisim schematic labels.
The input re-ordering (which made the SpinalHDL algo easier)
Relabel the P6_EXTND_EN to P7_EXTND_EN the original product index label was a bad choice in retrospect.
Provide the SpinalHDL directory to the project with the sbt project and netlist generation code.
Fill out SpinalHDL unit testing testing.
Test support for SUPPORT_SIGNED=false (try to completely remove nets from output instead of assigning constant False and letting synthesis optimize away)
Implement support for seperate SUPPORT_SIGNED for each input with 3 modes of operation ALWAYS/NEVER/BOTH(like now using control input bit)
Implement and test support for odd-sized inputs, so the width of X and Y or DEND and DSOR can be different sizes.
When input width can be unequal, test out the EOVERFLOW in the divider is wired to the correct port and works in this scenarios.
Provide unit testing for common multipler sizes, obvious byte boundaries but also the sizes common in FPGA DSP primitives.
# | Input | Output | Bidirectional |
---|---|---|---|
0 | Data0 see docs | Result0 see docs | Addr bit0 HI=1/lo=0 mux of Data and Result (input only) |
1 | Data1 see docs | Result1 see docs | |
2 | Data2 see docs | Result2 see docs | |
3 | Data3 see docs | Result3 see docs | Result mux registered=1/immediate=0 (input only) |
4 | Data4 see docs | Result4 see docs | DIV error overflow (output only) |
5 | Data5 see docs | Result5 see docs | DIV error divide-by-zero (output only) |
6 | Data6 see docs | Result6 see docs | OPSIGNED mode (input only) |
7 | Data7 see docs | Result7 see docs | MULDIV mode (input only) |