CS 100 (Learn) — CS 100 (Web) — Module 01

Unicode

NOTE: If your internet access is restricted and you do not have access to YouTube, we have provided alternate video links.

TRANSCRIPT

In this video we will learn about Unicode, which allows us to represent characters from many different languages from all over the world. (I'm sorry if you thought this was about unicorns)

The ASCII table was great and helped standardize communication across digital devices. However, it was very "American" (the A in ASCII), and it wasn't very long before Europeans wanted to communicate and use fancy accents [à, á, â, ã, ä, å] and other symbols.

Fortunately, the ASCII table was only one hundred and twenty eight [128] entries (or seven bits), and since each byte can store eight bits, there was room for more entries.

In the early nineteen eighties [1980s], extended ASCII was developed and additional characters were added including accented characters. There were actually several different variants of extended ASCII, and to be honest, it was a bit of a mess. In addition, it wasn't even close to being able to handle Asian characters and other more complicated characters. So, in the late nineteen eighties [1980s] Unicode was created.

The Unicode Standard is developed by a non-profit organization known as the "Unicode consortium", which is mostly comprised of multinational tech companies. They coordinate with the International Standards Organization and their ambitious goal is to

"enable people around the world to use computers in any language".

Unicode supports over one hundred and fifty [150] different languages and over one hundred thousand [100,000] unique characters. It also supports ligatures, where characters are combined together. Unicode continues to expand and evolve and new characters are added every year.

How does Unicode work? Well, it's complicated, but the simplest answer is that it uses more than one byte per character. The most common way to store Unicode in bytes is to use the UTF-8 standard, which stands for "Unicode Transformation Format". It's what over ninety percent [90%] of websites use. It mixes plain ASCII with Unicode. If a character is ASCII, then it uses just one byte to store that character. If a character is Unicode, then it uses two, three or four bytes... depending on the character. You don't need to understand how it works, just that it's a mixture of ASCII and Unicode.

You have probably seen what happens if a website or an application doesn't know how to interpret Unicode. This can happen when you are "cut and pasting" between websites or applications. Even something as simple as curly quotation marks can be a problem, because ASCII only uses straight quotes. You often end up with garbage where the Unicode characters are supposed to be. Instead of “hello”, you can end up with something like â€œhelloâ€▯.

For a lot of people, the most exciting part of Unicode is that it can represents emojis. If you visit the Unicode website [https://unicode.org/emoji/charts/full-emoji-list.html] you can see all the possible emojis and what their corresponding hex codes are.

For example, one of the most famous emojis is the "pile of poo" [💩], and the Unicode for that in hex is "one F four A nine" [#1F4A9]. In UTF-8, it would actually be spread over four bytes. So whenever you text someone that they are a "pile of poo", you are really sending those bytes.

I doubt that the early pioneers of computers could ever have imagined that one day there would be a way to encode "a smelly pile of poo" as digital information.