901 Hardware UTF Encoder/Decoder :: Quicker, easier and cheaper to make your own chip!

This project contains hardware logic to convert between the UTF‑8, UTF‑16, and UTF‑32 encodings for Unicode text.

It will detect and raise an error signal on overlong encodings, out of range code point values, and invalid byte sequences.

(You can optionally disable range checking if you wish to use the original UTF‑8 spec that supports values up to 0x7FFFFFFF.)

In the initial state, all dedicated inputs should be set HIGH.
At any time, set /RESET (rst_n) LOW and pulse CLK to reset all inputs and outputs to initial state.
At any time, set /ROUT (input 0) LOW and pulse CLK to seek to the beginning of the output.
You can set ERRS or /PROPS (input 1) HIGH to get an error status on the dedicated outputs.
You can set ERRS or /PROPS (input 1) LOW to get character properties on the dedicated outputs.
You can set CHK (input 2) HIGH to raise an error signal when the code point value is out of range (≥0x110000).
You can set CHK (input 2) LOW to ignore out of range code point values and encode/decode values up to 0x7FFFFFFF.
You can set CBE (input 3) HIGH to specify big endian order for UTF‑32 and UTF‑16 input and output.
You can set CBE (input 3) LOW to specify little endian order for UTF‑32 and UTF‑16 input and output.

Set READ or /WRITE (input 4) LOW.
Set /CIO (input 5, character I/O) LOW.
Set bidirectional I/O to the first byte of the UTF‑32 word and pulse CLK.
Set bidirectional I/O to the second byte of the UTF‑32 word and pulse CLK.
Set bidirectional I/O to the third byte of the UTF‑32 word and pulse CLK.
Set bidirectional I/O to the fourth byte of the UTF‑32 word and pulse CLK.
Set /CIO (input 5, character I/O) HIGH.
Set READ or /WRITE (input 4) HIGH.
If READY (output 0) is HIGH and ERROR (output 5) is LOW, the input and output are both valid.
If READY (output 0) is LOW or ERROR (output 5) is HIGH, the input was out of range (≥0x110000 or, if CHK is LOW, ≥0x80000000).

Set ERRS or /PROPS (input 1) LOW.
Set READ or /WRITE (input 4) LOW.
Set /UIO (input 6, UTF‑16 I/O) LOW.
Set bidirectional I/O to the first byte of the first UTF‑16 word and pulse CLK.
Set bidirectional I/O to the second byte of the first UTF‑16 word and pulse CLK.
If HIGHCHAR (output 3) is LOW, skip to step 9.
Set bidirectional I/O to the first byte of the second UTF‑16 word and pulse CLK.
Set bidirectional I/O to the second byte of the second UTF‑16 word and pulse CLK.
Set /UIO (input 6, UTF‑16 I/O) HIGH.
Set READ or /WRITE (input 4) HIGH.
Set ERRS or /PROPS (input 1) HIGH.
If READY (output 0) is HIGH and ERROR (output 5) is LOW, the input and output are both valid.
If RETRY (output 1) is HIGH, the first word was a high surrogate but the second word was not a low surrogate. The output will be the high surrogate only; the last word will need to be processed again.

Set READ or /WRITE (input 4) LOW.
Set /BIO (input 7, byte I/O) LOW.
Set bidirectional I/O to the current byte of the UTF‑8 sequence and pulse CLK.
Repeat step 3 until READY (output 0) or ERROR (output 5) is HIGH.
If READY (output 0) is HIGH and ERROR (output 5) is LOW, the input and output are both valid.
If RETRY (output 1) is HIGH, the UTF‑8 sequence was truncated (not enough continuation bytes). The output will be the truncated sequence only; the last byte will need to be processed again.
If INVALID (output 2) is HIGH, the UTF‑8 sequence was a single continuation byte or invalid byte (0xFE or 0xFF).
If OVERLONG (output 3) is HIGH, the UTF‑8 sequence was an overlong encoding.
If NONUNI (output 4) is HIGH, the UTF‑8 sequence was out of range (≥0x110000).

Set READ or /WRITE (input 4) HIGH.
Set /CIO (input 5, character I/O) LOW.
Pulse CLK and read the first byte of the UTF‑32 word from the bidirectional I/O.
Pulse CLK and read the second byte of the UTF‑32 word from the bidirectional I/O.
Pulse CLK and read the third byte of the UTF‑32 word from the bidirectional I/O.
Pulse CLK and read the fourth byte of the UTF‑32 word from the bidirectional I/O.
Set /CIO (input 5, character I/O) HIGH.
If the UTF‑32 word is within range, the input and output are both valid.
If the UTF‑32 word is not within range, then the input was either incomplete or invalid.

Set READ or /WRITE (input 4) HIGH.
If UEOF (output 6) is HIGH, then the input was either incomplete or invalid.
Set /UIO (input 6, UTF‑16 I/O) LOW.
Pulse CLK and read the next byte of the UTF‑16 sequence from the bidirectional I/O.
Repeat step 4 until UEOF (output 6) is HIGH.
Set /UIO (input 6, UTF‑16 I/O) HIGH.

Set READ or /WRITE (input 4) HIGH.
If BEOF (output 7) is HIGH, then the input was either incomplete or invalid.
Set /BIO (input 7, byte I/O) LOW.
Pulse CLK and read the next byte of the UTF‑8 sequence from the bidirectional I/O.
Repeat step 4 until BEOF (output 7) is HIGH.
Set /BIO (input 7, byte I/O) HIGH.

When ERRS or /PROPS (input 1) is HIGH, the dedicated outputs will be:

#	Name	Meaning
0	READY	The input and output are complete sequences.
1	RETRY	The previous input was invalid or the start of another sequence and was ignored. Process the output, reset, and try the previous input again.
2	INVALID	The input and output are invalid.
3	OVERLONG	The UTF‑8 input was an overlong sequence.
4	NONUNI	The code point value is out of range (≥0x110000). (This is set independently of the CHK input; the CHK input only changes whether this counts as an error.)
5	ERROR	Equivalent to (RETRY or INVALID or OVERLONG or (NONUNI and CHK)).

If all of these outputs are LOW, the accumulated input is incomplete and more input is required (underflow).

When ERRS or /PROPS (input 1) is LOW, the dedicated outputs will be:

#	Name	Meaning
0	NORMAL	The code point value is valid and not a C0 or C1 control character, surrogate, private use character, or noncharacter.
1	CONTROL	The code point value is valid and a C0 or C1 control character (0x00-0x1F or 0x7F-0x9F).
2	SURROGATE	The code point value is valid and a UTF‑16 surrogate (0xD800-0xDFFF).
3	HIGHCHAR	The code point value is valid and either a high surrogate (0xD800-0xDBFF) or a non-BMP character (≥0x10000).
4	PRIVATE	The code point value is valid and either a private use character (0xE000-0xF8FF, ≥0xF0000) or the high surrogate of a private use character (0xDB80-0xDBFF).
5	NONCHAR	The code point value is valid and a noncharacter (0xFDD0-0xFDEF or the last two code points of any plane).

If all of these outputs are LOW, there is no valid code point in the output.

The test.py file covers a comprehensive set of test cases which are listed in a separate file to avoid bloating the TT08 manual.

Any device that needs to process Unicode text.

901 Hardware UTF Encoder/Decoder

#	Input	Output	Bidirectional
0	/ROUT	READY; NORMAL	I/O LSB
1	ERRS, /PROPS	RETRY; CONTROL	I/O
2	CHK	INVALID; SURROGATE	I/O
3	CBE, /CLE	OVERLONG; HIGHCHAR	I/O
4	READ, /WRITE	NONUNI; PRIVATE	I/O
5	/CIO	ERROR; NONCHAR	I/O
6	/UIO	UEOF	I/O
7	/BIO	BEOF	I/O MSB