903 Hardware UTF Encoder/Decoder

903 : Hardware UTF Encoder/Decoder

Design render

How it works

This project contains hardware logic to convert between the UTF‑8, UTF‑16, and UTF‑32 encodings for Unicode text.

It will detect and raise an error signal on overlong encodings, out of range code point values, and invalid byte sequences.

(You can optionally disable range checking if you wish to use the original UTF‑8 spec that supports values up to 0x7FFFFFFF.)

Basic operation

  • In the initial state, all dedicated inputs should be set HIGH.
  • At any time, set /RESET (rst_n) LOW and pulse CLK to reset all inputs and outputs to initial state.
  • At any time, set /ROUT (input 0) LOW and pulse CLK to seek to the beginning of the output.
  • You can set ERRS or /PROPS (input 1) HIGH to get an error status on the dedicated outputs.
  • You can set ERRS or /PROPS (input 1) LOW to get character properties on the dedicated outputs.
  • You can set CHK (input 2) HIGH to raise an error signal when the code point value is out of range (≥0x110000).
  • You can set CHK (input 2) LOW to ignore out of range code point values and encode/decode values up to 0x7FFFFFFF.
  • You can set CBE (input 3) HIGH to specify big endian order for UTF‑32 and UTF‑16 input and output.
  • You can set CBE (input 3) LOW to specify little endian order for UTF‑32 and UTF‑16 input and output.

Inputting UTF‑32

  1. Set READ or /WRITE (input 4) LOW.
  2. Set /CIO (input 5, character I/O) LOW.
  3. Set bidirectional I/O to the first byte of the UTF‑32 word and pulse CLK.
  4. Set bidirectional I/O to the second byte of the UTF‑32 word and pulse CLK.
  5. Set bidirectional I/O to the third byte of the UTF‑32 word and pulse CLK.
  6. Set bidirectional I/O to the fourth byte of the UTF‑32 word and pulse CLK.
  7. Set /CIO (input 5, character I/O) HIGH.
  8. Set READ or /WRITE (input 4) HIGH.
  9. If READY (output 0) is HIGH and ERROR (output 5) is LOW, the input and output are both valid.
  10. If READY (output 0) is LOW or ERROR (output 5) is HIGH, the input was out of range (≥0x110000 or, if CHK is LOW, ≥0x80000000).

Inputting UTF‑16

  1. Set ERRS or /PROPS (input 1) LOW.
  2. Set READ or /WRITE (input 4) LOW.
  3. Set /UIO (input 6, UTF‑16 I/O) LOW.
  4. Set bidirectional I/O to the first byte of the first UTF‑16 word and pulse CLK.
  5. Set bidirectional I/O to the second byte of the first UTF‑16 word and pulse CLK.
  6. If HIGHCHAR (output 3) is LOW, skip to step 9.
  7. Set bidirectional I/O to the first byte of the second UTF‑16 word and pulse CLK.
  8. Set bidirectional I/O to the second byte of the second UTF‑16 word and pulse CLK.
  9. Set /UIO (input 6, UTF‑16 I/O) HIGH.
  10. Set READ or /WRITE (input 4) HIGH.
  11. Set ERRS or /PROPS (input 1) HIGH.
  12. If READY (output 0) is HIGH and ERROR (output 5) is LOW, the input and output are both valid.
  13. If RETRY (output 1) is HIGH, the first word was a high surrogate but the second word was not a low surrogate. The output will be the high surrogate only; the last word will need to be processed again.

Inputting UTF‑8

  1. Set READ or /WRITE (input 4) LOW.
  2. Set /BIO (input 7, byte I/O) LOW.
  3. Set bidirectional I/O to the current byte of the UTF‑8 sequence and pulse CLK.
  4. Repeat step 3 until READY (output 0) or ERROR (output 5) is HIGH.
  5. If READY (output 0) is HIGH and ERROR (output 5) is LOW, the input and output are both valid.
  6. If RETRY (output 1) is HIGH, the UTF‑8 sequence was truncated (not enough continuation bytes). The output will be the truncated sequence only; the last byte will need to be processed again.
  7. If INVALID (output 2) is HIGH, the UTF‑8 sequence was a single continuation byte or invalid byte (0xFE or 0xFF).
  8. If OVERLONG (output 3) is HIGH, the UTF‑8 sequence was an overlong encoding.
  9. If NONUNI (output 4) is HIGH, the UTF‑8 sequence was out of range (≥0x110000).

Outputting UTF‑32

  1. Set READ or /WRITE (input 4) HIGH.
  2. Set /CIO (input 5, character I/O) LOW.
  3. Pulse CLK and read the first byte of the UTF‑32 word from the bidirectional I/O.
  4. Pulse CLK and read the second byte of the UTF‑32 word from the bidirectional I/O.
  5. Pulse CLK and read the third byte of the UTF‑32 word from the bidirectional I/O.
  6. Pulse CLK and read the fourth byte of the UTF‑32 word from the bidirectional I/O.
  7. Set /CIO (input 5, character I/O) HIGH.
  8. If the UTF‑32 word is within range, the input and output are both valid.
  9. If the UTF‑32 word is not within range, then the input was either incomplete or invalid.

Outputting UTF‑16

  1. Set READ or /WRITE (input 4) HIGH.
  2. If UEOF (output 6) is HIGH, then the input was either incomplete or invalid.
  3. Set /UIO (input 6, UTF‑16 I/O) LOW.
  4. Pulse CLK and read the next byte of the UTF‑16 sequence from the bidirectional I/O.
  5. Repeat step 4 until UEOF (output 6) is HIGH.
  6. Set /UIO (input 6, UTF‑16 I/O) HIGH.

Outputting UTF‑8

  1. Set READ or /WRITE (input 4) HIGH.
  2. If BEOF (output 7) is HIGH, then the input was either incomplete or invalid.
  3. Set /BIO (input 7, byte I/O) LOW.
  4. Pulse CLK and read the next byte of the UTF‑8 sequence from the bidirectional I/O.
  5. Repeat step 4 until BEOF (output 7) is HIGH.
  6. Set /BIO (input 7, byte I/O) HIGH.

Error status

When ERRS or /PROPS (input 1) is HIGH, the dedicated outputs will be:

# Name Meaning
0 READY The input and output are complete sequences.
1 RETRY The previous input was invalid or the start of another sequence and was ignored. Process the output, reset, and try the previous input again.
2 INVALID The input and output are invalid.
3 OVERLONG The UTF‑8 input was an overlong sequence.
4 NONUNI The code point value is out of range (≥0x110000). (This is set independently of the CHK input; the CHK input only changes whether this counts as an error.)
5 ERROR Equivalent to (RETRY or INVALID or OVERLONG or (NONUNI and CHK)).

If all of these outputs are LOW, the accumulated input is incomplete and more input is required (underflow).

Character properties

When ERRS or /PROPS (input 1) is LOW, the dedicated outputs will be:

# Name Meaning
0 NORMAL The code point value is valid and not a C0 or C1 control character, surrogate, private use character, or noncharacter.
1 CONTROL The code point value is valid and a C0 or C1 control character (0x00-0x1F or 0x7F-0x9F).
2 SURROGATE The code point value is valid and a UTF‑16 surrogate (0xD800-0xDFFF).
3 HIGHCHAR The code point value is valid and either a high surrogate (0xD800-0xDBFF) or a non-BMP character (≥0x10000).
4 PRIVATE The code point value is valid and either a private use character (0xE000-0xF8FF, ≥0xF0000) or the high surrogate of a private use character (0xDB80-0xDBFF).
5 NONCHAR The code point value is valid and a noncharacter (0xFDD0-0xFDEF or the last two code points of any plane).

If all of these outputs are LOW, there is no valid code point in the output.

How to test

The test.py file covers a comprehensive set of test cases which are listed in a separate file to avoid bloating the TT09 manual.

External hardware

Any device that needs to process Unicode text.

IO

#InputOutputBidirectional
0/ROUTREADY; NORMALI/O LSB
1ERRS, /PROPSRETRY; CONTROLI/O
2CHKINVALID; SURROGATEI/O
3CBE, /CLEOVERLONG; HIGHCHARI/O
4READ, /WRITENONUNI; PRIVATEI/O
5/CIOERROR; NONCHARI/O
6/UIOUEOFI/O
7/BIOBEOFI/O MSB

Chip location

Controller Mux Mux Mux Mux Mux Mux Mux Mux Mux Analog Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Analog Mux Mux Mux Mux Analog Mux Mux Mux Mux Mux Mux tt_um_chip_rom (Chip ROM) tt_um_factory_test (Tiny Tapeout Factory Test) tt_um_oscillating_bones (Oscillating Bones) tt_um_rebelmike_incrementer (Incrementer) tt_um_rebeccargb_tt09ball_gdsart (TT09Ball GDS Art) tt_um_tt_tinyQV (TinyQV 'Asteroids' - Crowdsourced Risc-V SoC) tt_um_DalinEM_asic_1 (ASIC) tt_um_urish_simon (Simon Says memory game) tt_um_rburt16_bias_generator (Bias Generator) tt_um_librelane3_test (Tiny Tapeout LibreLane 3 Test) tt_um_10_vga_crossyroad (Crossyroad) tt_um_rebeccargb_universal_decoder (Universal Binary to Segment Decoder) tt_um_rebeccargb_hardware_utf8 (Hardware UTF Encoder/Decoder) tt_um_rebeccargb_intercal_alu (INTERCAL ALU) tt_um_rebeccargb_dipped (Densely Packed Decimal) tt_um_rebeccargb_styler (Styler) tt_um_rebeccargb_vga_timing_experiments (VGA Timing Experiments) tt_um_rebeccargb_colorbars (Color Bars) tt_um_rebeccargb_vga_pride (VGA Pride) tt_um_cw_vref (Current-Mode Bandgap Reference) tt_um_tinytapeout_logo_screensaver (VGA Screensaver with Tiny Tapeout Logo) tt_um_rburt16_opamp_3stage (OpAmp 3stage) tt_um_gamepad_pmod_demo (Gamepad Pmod Demo) tt_um_micro_tiles_container (Micro tile container) tt_um_virantha_enigma (Enigma - 52-bit Key Length) tt_um_jamesrosssharp_1bitam (1bit_am_sdr) tt_um_jamesrosssharp_tiny1bitam (Tiny 1-bit AM Radio) tt_um_MichaelBell_rle_vga (RLE Video Player) tt_um_MichaelBell_mandelbrot (VGA Mandelbrot) tt_um_murmann_group (Decimation Filter for Incremental and Regular Delta-Sigma Modulators) tt_um_betz_morse_keyer (Morse Code Keyer) tt_um_urish_giant_ringosc (Giant Ring Oscillator (3853 inverters)) tt_um_tiny_pll (Tiny PLL) tt_um_tc503_countdown_timer (Countdown Timer) tt_um_richardgonzalez_ped_traff_light (Pedestrian Traffic Light) tt_um_analog_factory_test (TT08 Analog Factory Test) tt_um_alexandercoabad_mixedsignal (mixedsignal) tt_um_tgrillz_sixSidedDie (Six Sided Die) tt_um_mattvenn_analog_ring_osc (Ring Oscillators) tt_um_vga_clock (VGA clock) tt_um_mattvenn_r2r_dac_3v3 (Analog 8 bit 3.3v R2R DAC) tt_um_mattvenn_spi_test (SPI test) tt_um_quarren42_demoscene_top (asic design is my passion) tt_um_micro_tiles_container_group2 (Micro tile container (group 2)) tt_um_z2a_rgb_mixer (RGB Mixer demo) tt_um_frequency_counter (Frequency counter) tt_um_urish_sic1 (SIC-1 8-bit SUBLEQ Single Instruction Computer) tt_um_tobi_mckellar_top (Capacitive Touch Sensor) tt_um_log_afpm (16-bit Logarithmic Approximate Floating Point Multiplier) tt_um_uwasic_dinogame (UW ASIC - Optimized Dino) tt_um_ece298a_8_bit_cpu_top (8-Bit CPU) tt_um_tqv_peripheral_harness (Rotary Encoder Peripheral) tt_um_led_matrix_driver (SPI LED Matrix Driver) tt_um_2048_vga_game (2048 sliding tile puzzle game (VGA)) tt_um_mac (MAC) tt_um_dpmunit (DPM_Unit) tt_um_nitelich_riscyjr (RISCY Jr.) tt_um_nitelich_conway (Conway's GoL) tt_um_pwen (Pulse Width Encoder) tt_um_mcs4_cpu (MCS-4 4004 CPU) tt_um_mbist (Design of SRAM BIST) tt_um_weighted_majority (Weighted Majority Voter / Trend Detector) tt_um_brandonramos_VGA_Pong_with_NES_Controllers (VGA Pong with NES Controllers) tt_um_brandonramos_opamp_ladder (2-bit Flash ADC) tt_um_NE567Mixer28 (OTA folded cascode) tt_um_acidonitroso_programmable_threshold_voltage_sensor (Programmable threshold voltage sensor) tt_um_DAC1 (tt_um_DAC1) tt_um_trivium_stream_processor (Trivium Stream Cipher) tt_um_analog_example (Digital OTA) tt_um_sortaALUAriaMitra (Sorta 4-Bit ALU) tt_um_RoyTr16 (Connect Four VGA) tt_um_jnw_wulffern (JNW-TEMP) tt_um_serdes (Secure SERDES with Integrated FIR Filtering) tt_um_limpix31_r0 (VGA Human Reaction Meter) tt_um_torurstrom_async_lock (Asynchronous Locking Unit) tt_um_galaguna_PostSys (Post's Machine CPU Based) tt_um_edwintorok (Rounding error) tt_um_td4 (tt-td04) tt_um_snn (Reward implemented Spiking Neural Network) tt_um_matrag_chirp_top (Tiny Tapeout Chirp Modulator) tt_um_sha256_processor_dvirdc (SHA-256 Processor) tt_um_pchri03_levenshtein (Fuzzy Search Engine) tt_um_AriaMitraClock (12 Hour Clock (with AM and PM)) tt_um_swangust (posit8_add) tt_um_DelosReyesJordan_HDL (Reaction Time Test) tt_um_upalermo_simple_analog_circuit (Simple Analog Circuit) tt_um_swangust2 (posit8_mul) tt_um_thexeno_rgbw_controller (RGBW Color Processor) tt_um_top_layer (Spike Detection and Classification System) tt_um_Alida_DutyCycleMeter (Duty Cycle Meter) tt_um_dco (Digitally Controlled Oscillator) tt_um_8bitalu (8-bit Pipelined ALU) tt_um_resfuzzy (resfuzzy) tt_um_javibajocero_top (MarcoPolo) tt_um_Scimia_oscillator_tester (Oscillator tester) tt_um_ag_priority_encoder_parity_checker (Priority Encoder with Parity Checker) tt_um_tnt_mosbius (tnt's variant of SKY130 mini-MOSbius) tt_um_program_counter_top_level (Test Design 1) tt_um_subdiduntil2_mixed_signal_classifier (Mixed-signal Classifier) tt_um_dac_test3v3 (Analog 8 bit 3.3v R2R DAC) tt_um_LPCAS_TP1 ( LPCAS_TP1 ) tt_um_regfield (Register Field) tt_um_delaychain (Delay Chain) tt_um_tdctest_container (Micro tile container) tt_um_spacewar (Spacewar) tt_um_Enhanced_pll (Enhance PLL) tt_um_romless_cordic_engine (ROM-less Cordic Engine) tt_um_ev_motor_control (PLC Based Electric Vehicle Motor Control System) tt_um_plc_prg (PLC-PRG) tt_um_kishorenetheti_tt16_mips (8-bit MIPS Single Cycle Processor) tt_um_snn_core (Adaptive Leaky Integrate-and-Fire spiking neuron core for edge AI) tt_um_myprocessor (8-bit Custom Processor) tt_um_sjsu (SJSU vga demo) tt_um_vedic_4x4 (Vedic 4x4 Multiplier) tt_um_braun_mult (8x8 Braun Array Multiplier) tt_um_r2r_dac (4-bit R2R DAC) tt_um_stochastic_integrator_tt9_CL123abc (Stochastic Integrator) tt_um_uart (UART Controller with FIFO and Interrupts) tt_um_lfsr_stevej (Linear Feedback Shift Register) tt_um_FFT_engine (FFT Engine) tt_um_tpu (Tiny Tapeout Tensor Processing Unit) tt_um_tt_tinyQVb (TinyQV 'Berzerk' - Crowdsourced Risc-V SoC) tt_um_IZ_RG_22 (IZ_RG_22) tt_um_32_bit_fp_ALU_S_M (32-bit floating point ALU) tt_um_AriaMitraGames (Games (Tic Tac Toe and Rock Paper Scissors)) tt_um_sc_bipolar_qif_neuron (Stochastic Computing based QIF model neuron) tt_um_mac_spst_tiny (Low Power and Enhanced Speed Multiplier, Accumulator with SPST Adder) tt_um_kb2ghz_xalu (4-bit minicomputer ALU) tt_um_emmersonv_tiq_adc (3 Bit TIQ ADC) tt_um_simonsays (Simon Says) tt_um_BNN (8-bit Binary Neural Network) tt_um_anweiteck_2stageCMOSOpAmp (2 Stage CMOS Op Amp) tt_um_6502 (Simplified 6502 Processor) tt_um_swangust3 (posit8_div) tt_um_jonathan_thing_vga (VGA-Video-Player) tt_um_wokwi_412635532198550529 (ttsky-pettit-wokproc-trainer) tt_um_vga_hello_world (VGA HELLO WORLD) tt_um_jyblue1001_pll (Analog PLL) tt_um_BryanKuang_mac_peripheral (8-bit Multiply-Accumulate (MAC) with 2-Cycle Serial Interface) tt_um_rebeccargb_tt09ball_screensaver (TT09Ball VGA Screensaver) tt_um_openfpga22 (Open FGPA 2x2 design) tt_um_andyshor_demux (Demux) tt_um_flash_raid_controller (SPI flash raid controller) tt_um_jonnor_pdm_microphone (PDM microphone) tt_um_digital_playground (Sky130 Digital Playground) tt_um_mod6_counter (Mod-6 Counter) tt_um_BMSCE_T2 (Choreo8) tt_um_Richard28277 (4-bit ALU) tt_um_shuangyu_top (Calculator) tt_um_wokwi_441382314812372993 (Sumador/restador de 4 bits) tt_um_TensorFlowE (TensorFlowE) tt_um_wokwi_441378095886546945 (7SDSC) tt_um_wokwi_440004235377529857 (Latched 4-bits adder) tt_um_dlmiles_tqvph_i2c (TinyQV I2C Controller Device) tt_um_markgarnold_pdp8 (Serial PDP8) tt_um_wokwi_441564414591667201 (tt-parity-detector) tt_um_vga_glyph_mode (VGA Glyph Mode) tt_um_toivoh_pwl_synth (PiecewiseOrionSynth Deluxe) tt_um_minirisc (MiniRISC-FSM) tt_um_wokwi_438920793944579073 (Multiple digital design structures) tt_um_sleepwell (Sleep Well) tt_um_lcd_controller_Andres078 (LCD_controller) tt_um_SummerTT_HDL (SJSU Summer Project: Game of Life) tt_um_chrishtet_LIF (Leaky Integrate and Fire Neuron) tt_um_diff (ttsky25_EpitaXC) tt_um_htfab_split_flops (Split Flops) tt_um_alu_4bit_wrapper (4-bit ALU with Flags) tt_um_tnt_rf_test (TTSKY25A Register File Test) tt_um_mosbius (mini mosbius) tt_um_robot_controller_top_module (AR Chip) tt_um_flummer_ltc (Linear Timecode (LTC) generator) tt_um_stress_sensor (Tiny_Tapeout_2025_three_sensors) tt_um_krisjdev_manchester_baby (Manchester Baby) tt_um_mbikovitsky_audio_player (Simple audio player) tt_um_wokwi_414123795172381697 (TinySnake) tt_um_vga_example (Jabulani Ball VGA Demo ) tt_um_stochastic_addmultiply_CL123abc (Stochastic Multiplier, Adder and Self-Multiplier) tt_um_nvious_graphics (nVious Graphics) tt_um_pe_simonbju (pe) tt_um_mikael (TinyTestOut) tt_um_brent_kung (brent-kung_4) tt_um_7FM_ShadyPong (ShadyPong) tt_um_algofoogle_vga_matrix_dac (Analog VGA CSDAC experiments) tt_um_tv_b_gone (TV-B-Gone) tt_um_sjsu_vga_music (SJSU Fight Song) tt_um_fsm_haz (FSM based RISC-V Pipeline Hazard Resolver) tt_um_dma (DMA controller) tt_um_3v_inverter_SiliconeGuide (Analog Double Inverter) tt_um_rejunity_lgn_mnist (LGN hand-written digit classifier (MNIST, 16x16 pixels)) tt_um_gray_sobel (Gray scale and Sobel filter for Edge Detection) tt_um_Xgamer1999_LIF (Demonstration of Leaky integrate and Fire neuron SJSU) tt_um_dac12 (12 bit DAC) tt_um_voting_machine (Digital Voting Machine) tt_um_updown_counter (8bit_up-down_counter) tt_um_openram_top (Single Port OpenRAM Testchip) tt_um_customalu (Custom ALU) tt_um_assaify_mssf_pll (24 MHz MSSF PLL) tt_um_Maj_opamp (2-Stage OpAmp Design) tt_um_wokwi_442131619043064833 (Encoder 7 segments display) tt_um_wokwi_441835796137492481 (TESVG Binary Counter and shif register ) tt_um_combo_haz (Combinational Logic Based RISC-V Pipeline Hazard Resolver) tt_um_tx_fsm (Design and Functional Verification of Error-Correcting FIFO Buffer with SECDED and ARQ ) tt_um_will_keen_solitaire (solitaire) tt_um_rom_vga_screensaver (VGA Screensaver with embedded bitmap ROM) tt_um_13hihi31_tdc (Time to Digital Converter) tt_um_dteal_awg (Arbitrary Waveform Generator) tt_um_LIF_neuron (AFM_LIF) tt_um_rebelmike_register (Circulating register test) tt_um_MichaelBell_hs_mul (8b10b decoder and multiplier) tt_um_SNPU (random_latch) tt_um_rejunity_atari2600 (Atari 2600) tt_um_bit_serial_cpu_top (16-bit bit-serial CPU) tt_um_semaforo (semaforo) tt_um_bleeptrack_cc1 (Cross stitch Creatures #1) tt_um_bleeptrack_cc2 (Cross stitch Creatures #2) tt_um_bleeptrack_cc3 (Cross stitch Creatures #3) tt_um_bleeptrack_cc4 (Cross stitch Creatures #4) tt_um_bitty (Bitty) tt_um_spi2ws2811x16 (spi2ws2811x8) tt_um_uart_spi (UART and SPI Communication blocks with loopback) tt_um_urish_charge_pump (Dickson Charge Pump) tt_um_adc_dac_tern_alu (adc_dac_BCT_addr_ALU_STI) tt_um_sky1 (GD Sky Processor) tt_um_fifo (ASYNCHRONOUS FIFO) tt_um_TT16 (Asynchronous FIFO) tt_um_axi4lite_top (Axi4_Lite) tt_um_TT06_pwm (PWM Generator) tt_um_hack_cpu (HACK CPU) tt_um_marxkar_jtag (JTAG CONTROLLER) tt_um_cache_controller (Simple Cache Controller) tt_um_stopwatchtop (Stopwatch with 7-seg Display) tt_um_adpll (all-digital pll) tt_um_tnt_rom_test (TT09 SKY130 ROM Test) tt_um_tnt_rom_nolvt_test (TT09 SKY130 ROM Test (no LVT variant)) tt_um_wokwi_414120207283716097 (fulladder) tt_um_kianV_rv32ima_uLinux_SoC (KianV uLinux SoC) tt_um_tv_b_gone_rom (TV-B-Gone-EU) Available Available Available Available Available Available Available Available Available Available Available Available Available