Parola A to Z – Handling non ASCII characters (UTF-8)

A question that I am asked on a regular basis is why particular characters in messages are not displayed ‘as-expected’ by the Parola library. These characters, often typed in from the Serial monitor or embedded within strings, contain non-ASCII characters. Here’s what is happening.

What causes the problem?

At its basic level, the issue is that non-ASCII characters are being processed as ASCII characters. Parola can be used to display non-ASCII characters but the programmer needs to take steps to ensure this happens.

For any display, a character code is just a number that identifies a pattern of pixels that represent the ‘visual’ of the character to be displayed; a bit pattern that is found in the font definitions. So if the character is interpreted incorrectly, the lookup will result in an incorrect or missing bitmap shown on the display.

ASCII vs Unicode

ASCII and Unicode are two different character encodings – standards on how to represent different characters manipulated in digital form. The main difference between the two is how they encode a character and the number of bits used for the encoding.

ASCII was developed soon after the C language and eventually replaced a number of competing proprietary character encoding formats. So it is quite old and established.

In ASCII all letters, digits, and symbols are encoded in 8 bits and represented as a number between 32 and 127. Codes less than 32 are reserved as ‘device control’ codes such as the newline character. As each byte can store 2⁸-1 (or 255) numbers, one byte has more than enough space to store the basic set of english characters.

As computing spread, non-english characters needed to be accommodated and people became really inventive on how to use the ASCII numbers from 128 to 255 still available. The same numbers represented different characters, depending on who was programming the software. It quickly became obvious that there was insufficient extra available numbers to represent the complete set of characters for all languages.

Unicode was therefore developed as the single character set that could represent every character in every language. Unicode currently contains characters for most written languages and has room for even more, so it won’t need to be replaced anytime soon.

Unicode uses a variable sized bit encoding with a choice between 32, 16, and 8-bit encodings. More bits allows more characters but at the expense of larger files; fewer bits limits character choice but creates smaller files.

This change requires a shift in how to interpret characters. In this new encoding, each character is an idealized abstract entity. Instead of a number, each character in this system is represented as a code-point written as U+00639 (where U stands for ‘Unicode’) and the numbers are hexadecimal.

To maintain compatibility with ASCII, Unicode was designed so that the first seven bits matched the ASCII characters. So when an ASCII encoded file is opened with a Unicode enabled editor, the ASCII characters are interpreted correctly. This reduced the impact of the new encoding standard for those who were already using ASCII (ie, at the time, almost everyone).

Where does UTF-8 fit in?

Unicode Transformation Format (UTF) are the encoding standards for the Unicode character set.

UTF-8 is the 8 bit variable length encoding variant of UTF. In UTF-8, every code-point from 0 to 127 is stored in a single byte, making it backward compatible with ASCII. Code points above 128 are stored using between 2 and 4 bytes. As UTF-8 is compact for Latin scripts and is ASCII-compatible, it became the de facto standard encoding for interchange of Unicode text.

Additional UTF standards include the UTF-16 variable length encoding and UTF-32 fixed length encoding (the whole Unicode code-point fits in 32 bits). These are not backward compatible with ASCII as each character is encoded in 16 or 32 bits respectively.

Converting UTF-8 to Unicode code-points

The table below shows how UTF-8 encoding works. The x characters are replaced by the bits of the code-point.

Source: https://en.wikipedia.org/wiki/UTF-8

The table below shows examples of how Unicode characters are encoded. The bits of the Unicode code-point are split over the requisite numbers of characters depending on how large it is.

Source: https://en.wikipedia.org/wiki/UTF-8

As an example, the code-point for the Euro character U+20AC would result in the UTF-8 encoded sequence of bytes 0xE2 0x82 0xAC. Software processing this text needs to receive all three bytes from a file or input stream before it knows what code-point to display.

Converting UTF-8 back to Unicode

All unicode encodings start with a character greater than 128 (ie, top bit set in the byte) and all the characters that are part of the encoding similarly have their top bit set.

Decoding a UTF-8 group back to a Unicode character becomes becomes a matter of parsing each byte in an input stream and recombining the encoded bits back in the right order. This is most easily done by processing each character using a Finite State Machine (see these earlier articles about FSMs).

In the Parola Library, the example Parola_UTF-8_Display includes a decoding function that works for 2 byte UTF-8 encodings, mapping decoded characters correctly to the locally included font definition. It is not designed to work with UTF-8 encoding longer than 2 bytes.

A more complete discussion of an efficient and fast decoding algorithm can be found here. Full decoding requires substantial amounts of flash RAM for the FSM lookup tables. In Parola, each character also needs a bitmap defined for the display output. Neither of these will fit into memory constrained MCUs like the Arduino Uno, but would be feasible for more advanced processors with megabytes of RAM.

The single single most important thing to remember from the information presented above is that there is no string or text without an accompanying encoding standard (implicit or otherwise). To a computer user this is transparent, but as a programmer it needs to be kept top of mind when processing text.

6 replies on “Parola A to Z – Handling non ASCII characters (UTF-8)”

Marco, tudo bem? Estou comecando no arduino e na sua lib md_parola, e gostaria de perguntar: tem como eu rotacionar o texto? Na oestou com problemas, meu projeto esta certinho…mas queria poder dar a opcao do texto ser escrito em qualquer direcao… E minha segunda duvida é sobre acentuacao… li aqui o seu blog, mas nao entendi muito bem… eu teria que criar uma tabela, tipo um array de bytes com a representacao desses caracteres especiais,,é isso:? abracao!

LikeLike

>can I rotate the text?
Yes, You need to look at the example for double height text – the top row is standard font rotated 90 degrees. For vertical text you need to define a different font. There are blog articles for both of these situations.

>I would have to create a table, like an array of bytes with the representation of these special characters
You need to create a font file. The current font already contains the Latin characters that you will probably need, but you will need to translate the UTF-8 characters into the right chacracter code, which is what is described in this article. There is also a blog article on how to define a new font.

thanks for your answer..i will translate to engligh now… lol.. sorry for that…

im usign your example “parola utf8 display” and i put the methods for conversion;;; but when i try a char like “ç” or “é” they appear like :

“ç”
195 167 0 0 0 0 0 0 0 0 0
appears like “Ã§” (a tilde and section sign)

could you help me? If you have an email, i could send my project to you..

thanks!

Rafael

Hi! im testing a lot, and what im doing is:

char msgDisplay[] = “éç”;
showMsg

void showMsg(String msg) {
for (uint8_t i=0; i<ARRAY_SIZE(msg); i++)
{
utf8Ascii(msg[i]);
}

strcpy(msgDisplay, msg.c_str());
P.displayText(msgDisplay, scrollAlign, scrollSpeed, scrollPause, scrollEffect, scrollEffect);
}

but the error is in the conversion… ;(

regards,

i can’t get “Æ Ø Å” to show up no matter WHAT i do and im about to brake something cause i’ve used many hours on this…. i have included the Parola_fonts_data.h and i’ve checked that it contains æ Æ ø Ø å Å and other characters i need! and in the setup() i have P.setFont(ExtASCII);

what else do i need?

As you don’t say what you have done, it is hard to suggest anything other than to make sure that you know what font index the characters are mapping to. You need to print out the character values (decimal numbers) AFTER your UTF-8 conversion. If these don’t match the characters you expect either the font file is wrong or the conversion is wrong.