
This is a hardware implementation of the ChaCha20 stream cipher as specified in RFC 8439: a 256-bit key, a 96-bit nonce, and a 32-bit block counter. ChaCha20 produces a keystream that is XORed with the data, so encryption and decryption are the same operation.
The design is a command-driven peripheral. A host loads the key, nonce, and counter, then issues one of two operations:
It has three blocks:
chacha20_core computes the ChaCha20 block function. It holds the 16-word
(512-bit) state, runs the 20 rounds through four parallel quarter-round units
(one Add-Rotate-XOR step per clock), then adds the original state in. The
controller reads the result one 32-bit word at a time, so no wide keystream
bus is materialised.chacha20_controller is the command FSM. It decodes the command byte,
collects the payload, drives the core, and streams keystream/ciphertext bytes
back out. It speaks a transport-agnostic byte interface (one byte in with a
valid strobe, one byte out with a busy/send handshake).uio[3]):
MODE = 0: a UART (8N1) at baud = clock / 200.MODE = 1: a synchronous parallel byte bus (one byte per clock).Both front-ends present the identical byte interface to the controller, so the core and protocol are the same regardless of which one is used.
At the 35 MHz target clock, the core computes a 64-byte block in about 84 cycles (~0.76 bytes/cycle), a theoretical ceiling of ~27 MB/s. Deliverable throughput is set by the host interface:
GEN at HOLD_SEL = 0,
~6.5 MB/s at the default HOLD_SEL = 1. The workable HOLD_SEL, and hence the
real rate, depends on output-pad settling and how fast the host reads. CRYPT
is lower (each byte is a round trip).GEN and ~8.5 KB/s for CRYPT.Functionally validated on the Tiny Tapeout FPGA breakout (iCE40 UP5K) over both
interfaces: single- and multi-block GEN, CRYPT, decrypt round-trip, and
command-error handling, checked against the reference model.
Every command is a single command byte, optionally followed by a fixed-size payload. After a command completes, BUSY returns low; wait for that before sending the next command.
| Command | Byte | Payload | Effect |
|---|---|---|---|
LOAD_KEY |
0x01 | 32 bytes | Load the 256-bit key. |
LOAD_NONCE |
0x02 | 12 bytes | Load the 96-bit nonce. |
LOAD_CTR |
0x03 | 4 bytes (little-endian) | Load the 32-bit block counter. |
GEN |
0x04 | 1 byte N |
Emit N × 64 keystream bytes. |
CRYPT |
0x05 | 2 bytes length L (little-endian) + data |
For each of L data bytes in, return one XORed byte. |
Key and nonce bytes are sent in natural order (byte 0 first); they map directly
onto the RFC 8439 little-endian state layout. The block counter advances
automatically across multiple 64-byte blocks within one GEN or CRYPT.
For CRYPT, the data phase is interleaved: send one plaintext byte, read one
ciphertext byte, repeat for all L bytes. Decryption is the same command run on
the ciphertext (and the same key/nonce/counter).
Reset the chip by holding rst_n low for at least a few clock cycles, then
releasing it. Pick an interface with the MODE pin and talk to it with the command
protocol above.
A minimal GEN run (using a known key/nonce/counter) is:
LOAD_KEY: send 0x01 then the 32 key bytes.LOAD_NONCE: send 0x02 then the 12 nonce bytes.LOAD_CTR: send 0x03 then the 4 counter bytes (little-endian).GEN: send 0x04 then 0x01 to request one block.The output matches the ChaCha20 keystream for that key/nonce/counter (see the
RFC 8439 test vectors, or test/chacha20_ref.py in the repository, which is the bit-exact
reference the test suite checks against). To encrypt, use CRYPT (0x05, the
2-byte length, then the data); to decrypt, run CRYPT again on the ciphertext.
The default interface. 8 data bits, no parity, 1 stop bit; baud = clock / 200 (175000 baud at the 35 MHz default clock). On the Tiny Tapeout demo board this connects to the RP2040 USB-serial bridge, so a PC can drive it directly.
ui[3] (host → chip)uo[4] (chip → host)uo[0], ERR = uo[1]A faster byte-at-a-time interface for a host that shares the chip's clock.
ui[7:0]; pulse WR (uio[0]) high for one cycle to write a
byte. Gaps between bytes are fine: only WR-high cycles capture data.uo[7:0]; read it while VALID (uio[1]) is high.uio[2], ERR = uio[6].uio[5:4]) sets how long each output byte is held: HOLD_SEL + 1
clock cycles (1–4). Use a longer hold to give a latency-bound reader and the
output pad more time to settle. The GF180 output pad's maximum toggle rate is
not yet characterised, so raise HOLD_SEL on real silicon if a fast reader misses
bytes.Host requirements in parallel mode: drive MODE high before operating; pulse WR
for exactly one cycle per byte; and wait for BUSY low between commands. The
CRYPT data phase can be streamed back to back (one plaintext byte in, one
ciphertext byte out) with no pacing at the 64-byte block boundaries: the
controller holds a byte that arrives while it is recomputing the next keystream
block.
None required. In UART mode the design is driven over the Tiny Tapeout demo board's RP2040 USB-serial bridge from a PC. Parallel mode is optional and is intended for a clock-synchronous host such as an RP2040 PIO program or an FPGA (for example via the Tiny Tapeout FPGA dev board or a PMOD).
| # | Input | Output | Bidirectional |
|---|---|---|---|
| 0 | PDIN0 | BUSY / PDOUT0 | PAR_WR (in) |
| 1 | PDIN1 | ERR / PDOUT1 | PAR_VALID (out) |
| 2 | PDIN2 | PDOUT2 | PAR_BUSY (out) |
| 3 | RX / PDIN3 | PDOUT3 | MODE (in: 1=parallel) |
| 4 | PDIN4 | TX / PDOUT4 | HOLD_SEL0 (in) |
| 5 | PDIN5 | PDOUT5 | HOLD_SEL1 (in) |
| 6 | PDIN6 | PDOUT6 | PAR_ERR (out) |
| 7 | PDIN7 | PDOUT7 |