39 ChaCha20

39 : ChaCha20

Design render

How it works

This is a hardware implementation of the ChaCha20 stream cipher as specified in RFC 8439: a 256-bit key, a 96-bit nonce, and a 32-bit block counter. ChaCha20 produces a keystream that is XORed with the data, so encryption and decryption are the same operation.

The design is a command-driven peripheral. A host loads the key, nonce, and counter, then issues one of two operations:

  • GEN: emit raw keystream bytes.
  • CRYPT: XOR a stream of data bytes with the keystream (encrypt or decrypt).

It has three blocks:

  • chacha20_core computes the ChaCha20 block function. It holds the 16-word (512-bit) state, runs the 20 rounds through four parallel quarter-round units (one Add-Rotate-XOR step per clock), then adds the original state in. The controller reads the result one 32-bit word at a time, so no wide keystream bus is materialised.
  • chacha20_controller is the command FSM. It decodes the command byte, collects the payload, drives the core, and streams keystream/ciphertext bytes back out. It speaks a transport-agnostic byte interface (one byte in with a valid strobe, one byte out with a busy/send handshake).
  • A host front-end, selected at runtime by the MODE pin (uio[3]):
    • MODE = 0: a UART (8N1) at baud = clock / 200.
    • MODE = 1: a synchronous parallel byte bus (one byte per clock).

Both front-ends present the identical byte interface to the controller, so the core and protocol are the same regardless of which one is used.

Performance

At the 35 MHz target clock, the core computes a 64-byte block in about 84 cycles (~0.76 bytes/cycle), a theoretical ceiling of ~27 MB/s. Deliverable throughput is set by the host interface:

  • Parallel (MODE = 1): an estimated ~8 MB/s for GEN at HOLD_SEL = 0, ~6.5 MB/s at the default HOLD_SEL = 1. The workable HOLD_SEL, and hence the real rate, depends on output-pad settling and how fast the host reads. CRYPT is lower (each byte is a round trip).
  • UART (MODE = 0): 175000 baud, giving ~17.5 KB/s for GEN and ~8.5 KB/s for CRYPT.

Functionally validated on the Tiny Tapeout FPGA breakout (iCE40 UP5K) over both interfaces: single- and multi-block GEN, CRYPT, decrypt round-trip, and command-error handling, checked against the reference model.

Command protocol

Every command is a single command byte, optionally followed by a fixed-size payload. After a command completes, BUSY returns low; wait for that before sending the next command.

Command Byte Payload Effect
LOAD_KEY 0x01 32 bytes Load the 256-bit key.
LOAD_NONCE 0x02 12 bytes Load the 96-bit nonce.
LOAD_CTR 0x03 4 bytes (little-endian) Load the 32-bit block counter.
GEN 0x04 1 byte N Emit N × 64 keystream bytes.
CRYPT 0x05 2 bytes length L (little-endian) + data For each of L data bytes in, return one XORed byte.

Key and nonce bytes are sent in natural order (byte 0 first); they map directly onto the RFC 8439 little-endian state layout. The block counter advances automatically across multiple 64-byte blocks within one GEN or CRYPT.

For CRYPT, the data phase is interleaved: send one plaintext byte, read one ciphertext byte, repeat for all L bytes. Decryption is the same command run on the ciphertext (and the same key/nonce/counter).

Status outputs

  • BUSY: high while the controller is not idle.
  • ERR: latches high if an unrecognised command byte is received; clears on reset.

How to test

Reset the chip by holding rst_n low for at least a few clock cycles, then releasing it. Pick an interface with the MODE pin and talk to it with the command protocol above.

A minimal GEN run (using a known key/nonce/counter) is:

  1. LOAD_KEY: send 0x01 then the 32 key bytes.
  2. LOAD_NONCE: send 0x02 then the 12 nonce bytes.
  3. LOAD_CTR: send 0x03 then the 4 counter bytes (little-endian).
  4. GEN: send 0x04 then 0x01 to request one block.
  5. Read the 64 keystream bytes that stream back.

The output matches the ChaCha20 keystream for that key/nonce/counter (see the RFC 8439 test vectors, or test/chacha20_ref.py in the repository, which is the bit-exact reference the test suite checks against). To encrypt, use CRYPT (0x05, the 2-byte length, then the data); to decrypt, run CRYPT again on the ciphertext.

UART mode (MODE = 0)

The default interface. 8 data bits, no parity, 1 stop bit; baud = clock / 200 (175000 baud at the 35 MHz default clock). On the Tiny Tapeout demo board this connects to the RP2040 USB-serial bridge, so a PC can drive it directly.

  • RX = ui[3] (host → chip)
  • TX = uo[4] (chip → host)
  • BUSY = uo[0], ERR = uo[1]

Parallel mode (MODE = 1)

A faster byte-at-a-time interface for a host that shares the chip's clock.

  • Data in = ui[7:0]; pulse WR (uio[0]) high for one cycle to write a byte. Gaps between bytes are fine: only WR-high cycles capture data.
  • Data out = uo[7:0]; read it while VALID (uio[1]) is high.
  • BUSY = uio[2], ERR = uio[6].
  • HOLD_SEL (uio[5:4]) sets how long each output byte is held: HOLD_SEL + 1 clock cycles (1–4). Use a longer hold to give a latency-bound reader and the output pad more time to settle. The GF180 output pad's maximum toggle rate is not yet characterised, so raise HOLD_SEL on real silicon if a fast reader misses bytes.

Host requirements in parallel mode: drive MODE high before operating; pulse WR for exactly one cycle per byte; and wait for BUSY low between commands. The CRYPT data phase can be streamed back to back (one plaintext byte in, one ciphertext byte out) with no pacing at the 64-byte block boundaries: the controller holds a byte that arrives while it is recomputing the next keystream block.

External hardware

None required. In UART mode the design is driven over the Tiny Tapeout demo board's RP2040 USB-serial bridge from a PC. Parallel mode is optional and is intended for a clock-synchronous host such as an RP2040 PIO program or an FPGA (for example via the Tiny Tapeout FPGA dev board or a PMOD).

IO

#InputOutputBidirectional
0PDIN0BUSY / PDOUT0PAR_WR (in)
1PDIN1ERR / PDOUT1PAR_VALID (out)
2PDIN2PDOUT2PAR_BUSY (out)
3RX / PDIN3PDOUT3MODE (in: 1=parallel)
4PDIN4TX / PDOUT4HOLD_SEL0 (in)
5PDIN5PDOUT5HOLD_SEL1 (in)
6PDIN6PDOUT6PAR_ERR (out)
7PDIN7PDOUT7

Chip location

Controller Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux tt_um_chip_rom (Chip ROM) tt_um_factory_test (Tiny Tapeout Factory Test) tt_um_Vincent2405_adder_tree (BSD Convolution Adder Tree) tt_um_BastiBudde_i2c_slave_sensor (I2C Slave Template with Emulated Sensor) tt_um_60hz_load (60 Hz Grid-Forming ASIC with Dump-Load Control) tt_um_spi_config_reg (Simple SPI configuration for analog designs) tt_um_ex_drosen766 (Project) tt_um_spi_cpu_top (SPI-CPU) tt_um_d5smith_mfa (Music for ASICs) tt_um_i2c_master (I2C Master Controller) tt_um_aswarby_mac (Aswarby INT8 MAC) tt_um_arrakeen_spsram_direct (TT-Arrakeen-SPSRAM-direct) tt_um_alu (8-bit Interactive ALU) tt_um_JCT_PoC (ttgf jct PoC) tt_um_jct_lea (LEA-128) tt_um_cwru_cpu (CWRU CPU) tt_um_teapot (100Mbps Ethernet Accelerator Wrapper) tt_um_jte_cordic (CORDIC sin/cos generator) tt_um_aidenkoch4 (Three Channel RGB PWM Controller) tt_um_pschuetz_tremolo (Tremolo guitar pedal ASIC) tt_um_jsabree11_fibonacci_checker (fibbonaci_tt) tt_um_connerdaehler_boop (Procedural ASIC) tt_um_Kieckenwama_Traffic_LIGHT_FSM (Traffic Light FSM) tt_um_KimLuu02_WashingMachine_FSM (WashingMachine_FSM) tt_um_PaulineKreis_PWM_Analyser (PWM-Analyser) tt_um_PWM (PWM Generator) tt_um_wokwi_466666882406199297 (Simple Sprinkler) tt_um_rebeccargb_universal_decoder (Universal Binary to Segment Decoder) tt_um_rebeccargb_hardware_utf8 (Hardware UTF Encoder/Decoder) tt_um_spi_master (SPI Master Slave Communication) tt_um_likitha_trng (Secure TRNG Entropy Generator) tt_um_wnn (8-bit WNN Pattern Recognizer) tt_um_raksha (Raksha) tt_um_uart_soc (UART_SOC) tt_um_ecdsa_verify (ECDSA Verification) tt_um_ecc_processor (ECC Processor) tt_um_fast_auth (Fast Authentication Accelerator) tt_um_karthik_trng (TRNG using Ring Oscillator) tt_um_push (Secure V2X Mini Demonstrator) tt_um_santosh_aes_sbox (AES S-Box Accelerator) tt_um_hardware_anomaly_detection (Hardware Anomaly Detection) tt_um_multi_protocol (Multi-Protocol Communication Controller) tt_um_pqc_ntt_butterfly (PQC NTT Butterfly Core) tt_um_cambridge_nlfsr (Programmable Chaotic NLFSR) tt_um_4b_accumulator_cpu (4 bit Accumulator CPU) tt_um_spi_slave (SPI Slave with 8-Register File) tt_um_geeta_doddamani_lfsr (4-bit Maximum-Length LFSR) tt_um_ecc_accelerator (ECC Scalar Accelerator) tt_um_egurapha_chacha20 (ChaCha20) tt_um_configurable_pwm (Configurable PWM Generator) tt_um_Arctic0 (Arctic0 16-bit CPU) tt_um_comp8 (8-bit Comparator) tt_um_pwm_cit (Configurable 8-bit PWM Generator) tt_um_rameshwar_door_lock (Digital Door Lock) tt_um_sandy_venky (8-bit LFSR Circuit) tt_um_ljhahne_pong (Pong) tt_um_v2x_warning (V2X Collision Warning) tt_um_ecc_scalar_mult (ECC Scalar Multiplication) tt_um_fhw_appel_spiPWMio (spiPWMio) tt_um_arrakeen_spsram_direct_sramrules (TT-Arrakeen-SPSRAM-direct-sramrules) tt_um_arrakeen_spsram_direct_5v (TT-Arrakeen-SPSRAM-direct-5V) tt_um_LukeSilva_cartrip (Car Trip) tt_um_coffeepot (100Mpbs 3 port Ethernet switch) tt_um_emiliopeju_lightscan (Lightscan) tt_um_Alanduan21_triad01_top (triad01) tt_um_lif_snn (4-Neuron LIF Spiking Neural Network) tt_um_smerity_mandelbrot (Smerity-Mandelbrot) tt_um_elvtide01_7SegmentDice (7SegmentDice) tt_um_elemental_harmony (Elemental Harmony Game) tt_um_pattern_gen (Programmable Waveform and PWM Generator) tt_um_antimatter15_pdm_vad (PDM Voice Activity Detector) tt_um_layla_spike_detector (Neural Spike Detector) tt_um_detronyx_arith_lab (Detronyx Arithmetic Lab Tile) tt_um_hasheddan_nni (Nearest Neighbor Interpolation) tt_um_brisq (BRISQ) tt_um_santhosh_spike_codec_gf (Neuromorphic Spike Codec (GF180)) tt_um_santhosh_aer_router_gf (Asynchronous-AER Spike Router (4-phase REQ/ACK, 16-entry routing table, GF180)) tt_um_santhosh_snn_wta_gf (Spiking Neural Network WTA Inference Engine (GF180)) tt_um_santhosh_cim_bist_gf (CIM Controller with BIST and Fault Map (GF180)) tt_um_santhosh_neuro_puf_gf (Neuromorphic PUF (distinct-tap LFSR arbiter + memristor XOR, GF180)) tt_um_detronyx_uart_trace_exerciser (Detronyx UART Trace Exerciser) tt_um_ro_puf (Tiny RIng Oscillator PUF) tt_um_franretfie_top (Quadrature sine generator) tt_um_cherny_xor_8bi (XORing given bits) tt_um_mealycpp_ascon_sdmc_uart (ASCON Integrated Crypto Processor) tt_um_reflex_s4 (AER Reflex Chip - MCP2515 CAN gateway) tt_um_polytrig_core (PolyTrig Digital Waveform Synthesis Core) tt_um_waferspace_vga_screensaver (Wafer.space Logo VGA Screensaver) tt_um_2048_vga_game (2048 sliding tile puzzle game (VGA)) tt_um_urish_simon (Simon Says memory game) Available