A question that I am asked on a regular basis is why particular characters in messages are not displayed ‘as-expected’ by the Parola library. These characters, often typed in from the Serial monitor or embedded within strings, contain non-ASCII characters. Here’s what is happening.
What causes the problem?
At its basic level, the issue is that non-ASCII characters are being processed as ASCII characters. Parola can be used to display non-ASCII characters but the programmer needs to take steps to ensure this happens.
For any display, a character code is just a number that identifies a pattern of pixels that represent the ‘visual’ of the character to be displayed; a bit pattern that is found in the font definitions. So if the character is interpreted incorrectly, the lookup will result in an incorrect or missing bitmap shown on the display.
ASCII vs Unicode
ASCII and Unicode are two different character encodings – standards on how to represent different characters manipulated in digital form. The main difference between the two is how they encode a character and the number of bits used for the encoding.
ASCII was developed soon after the C language and eventually replaced a number of competing proprietary character encoding formats. So it is quite old and established.
In ASCII all letters, digits, and symbols are encoded in 8 bits and represented as a number between 32 and 127. Codes less than 32 are reserved as ‘device control’ codes such as the newline character. As each byte can store 2⁸-1 (or 255) numbers, one byte has more than enough space to store the basic set of english characters.
As computing spread, non-english characters needed to be accommodated and people became really inventive on how to use the ASCII numbers from 128 to 255 still available. The same numbers represented different characters, depending on who was programming the software. It quickly became obvious that there was insufficient extra available numbers to represent the complete set of characters for all languages.
Unicode was therefore developed as the single character set that could represent every character in every language. Unicode currently contains characters for most written languages and has room for even more, so it won’t need to be replaced anytime soon.
Unicode uses a variable sized bit encoding with a choice between 32, 16, and 8-bit encodings. More bits allows more characters but at the expense of larger files; fewer bits limits character choice but creates smaller files.
This change requires a shift in how to interpret characters. In this new encoding, each character is an idealized abstract entity. Instead of a number, each character in this system is represented as a code-point written as U+00639 (where U stands for ‘Unicode’) and the numbers are hexadecimal.
To maintain compatibility with ASCII, Unicode was designed so that the first seven bits matched the ASCII characters. So when an ASCII encoded file is opened with a Unicode enabled editor, the ASCII characters are interpreted correctly. This reduced the impact of the new encoding standard for those who were already using ASCII (ie, at the time, almost everyone).
Where does UTF-8 fit in?
Unicode Transformation Format (UTF) are the encoding standards for the Unicode character set.
UTF-8 is the 8 bit variable length encoding variant of UTF. In UTF-8, every code-point from 0 to 127 is stored in a single byte, making it backward compatible with ASCII. Code points above 128 are stored using between 2 and 4 bytes. As UTF-8 is compact for Latin scripts and is ASCII-compatible, it became the de facto standard encoding for interchange of Unicode text.
Additional UTF standards include the UTF-16 variable length encoding and UTF-32 fixed length encoding (the whole Unicode code-point fits in 32 bits). These are not backward compatible with ASCII as each character is encoded in 16 or 32 bits respectively.
Converting UTF-8 to Unicode code-points
The table below shows how UTF-8 encoding works. The x characters are replaced by the bits of the code-point.
The table below shows examples of how Unicode characters are encoded. The bits of the Unicode code-point are split over the requisite numbers of characters depending on how large it is.
As an example, the code-point for the Euro character U+20AC would result in the UTF-8 encoded sequence of bytes 0xE2 0x82 0xAC. Software processing this text needs to receive all three bytes from a file or input stream before it knows what code-point to display.
Converting UTF-8 back to Unicode
All unicode encodings start with a character greater than 128 (ie, top bit set in the byte) and all the characters that are part of the encoding similarly have their top bit set.
Decoding a UTF-8 group back to a Unicode character becomes becomes a matter of parsing each byte in an input stream and recombining the encoded bits back in the right order. This is most easily done by processing each character using a Finite State Machine (see these earlier articles about FSMs).
In the Parola Library, the example Parola_UTF-8_Display includes a decoding function that works for 2 byte UTF-8 encodings, mapping decoded characters correctly to the locally included font definition. It is not designed to work with UTF-8 encoding longer than 2 bytes.
A more complete discussion of an efficient and fast decoding algorithm can be found here. Full decoding requires substantial amounts of flash RAM for the FSM lookup tables. In Parola, each character also needs a bitmap defined for the display output. Neither of these will fit into memory constrained MCUs like the Arduino Uno, but would be feasible for more advanced processors with megabytes of RAM.
The single single most important thing to remember from the information presented above is that there is no string or text without an accompanying encoding standard (implicit or otherwise). To a computer user this is transparent, but as a programmer it needs to be kept top of mind when processing text.