Clarifying the basic concepts

I’m sure everyone knows about this to some extent, but somehow that knowledge is lost in debates over text, so let’s first reiterate a bit: Computers cannot store “words”, “numbers”, “photos”, or anything else. The only thing it can save and work with are bits. A bit can only have 2 values: yes or no, true or false, 1 or 0, you like to call it anyway. Since computers operate on electricity, a bit can essentially be represented by the voltage, current pulse, or electrical state of the flip-flop circuit. For humans, bits are usually represented by 1s and 0s so make this a convention throughout this article.

Watching: What is Decoder?

To use bits to represent anything, we need rules. We need to convert a string of bits into something like letters, numbers, and pictures using an encoding scheme, or encoding for short. Like this:

01100010 01101001 01110100 01110011b i t sIn this encoding, 01100010 represents the letter “b”, 01101001 for the letter “i”, 01110100 for the letter “t” and 01110011 for the letter “s”. A certain sequence of bits will represent a letter, and a letter will represent a certain sequence of bits. If you have a good memory to remember the bit string for 26 letters, you can read the bit like reading a book.

The encoding scheme above is called ASCII. A sequence of 1s and 0s divided into parts, each of 8 bits (or 1 byte). ASCII specifies a table for human-readable byte-to-letter translation. Here is a small portion of that table:

bitscharacter 01000001 A 01000010 B 01000011 C 01000100 D 01000101 E 01000110 F

There are a total of 95 readable characters specified in the ASCII table, including lowercase and uppercase letters A through Z, numbers 0 through 9, some punctuation, and characters such as dollars. , exclamation point and a few other things. It also includes 33 values ​​for some things like space, carriage return, tab, backspace, etc. These are of course not printable, but are still visible in some form and are directly useful to the user. People. Some values ​​are only useful to computers, like codes to mark the beginning and end of text. A total of 128 characters are defined in the ASCII encoding, which is a nice number (to those familiar with computers), because it uses up all possible combinations of 7 bits (0000000 through 1111111).

And now we have a way to represent text using just 1s and 0s:

01001000 01100101 01101100 01101100 01101111 00100000 01010111 01101111 01110010 01101100 01100100 “Hello World” Important Terms To encode something in ASCII, follow the table from right to left, replacing letters with bits. To decode a string of bits into readable characters, follow the table from left to right, replacing bits with letters.

Encoding means using something to represent something else. encoding is a set of rules for doing that conversion.

READ MORE  What is the English Statement, What is the English Statement

Some other terms that need clarification in this context:

character set, charset

Set of characters that can be encoded. “ASCII encoding consists of a character set of 128 characters.” Essentially synonymous with “encoding”.

code page

A “page” of codes to associate characters with a corresponding sequence of bits. Can also be understood as a “table”. Essentially synonymous with “encoding”.

string

A string is a number of elements stringed together. A bit string is a series of bits, like 01010011. A character string is a series of characters, like this. Synonymous with “sequence”.

Binary, Octal, Decimal, Hex

There are many ways to write a number. 10011111 in binary is 237 in octal, 159 in decimal and 9F in hexadecimal. They both represent a value, but hexadecimal numbers are shorter and easier to read than binary numbers. However, I will use binary throughout this article to make the problem easier to understand and remove a layer of abstraction. Don’t worry if you see somewhere the character codes are written in another system, they are all the same.

Excusez-Moi?

Now that we have the above points in mind, let’s be honest: 95 characters is too little when we talk about languages. It can apply to basic English, but what if we want to write a risqué letter in French? Straßenübergangsänderungsgesetz (road law) in German? An invitation to a smörgåsbord (standing party) in Swedish? Well, you can’t. It’s not possible in ASCII. There is no instruction for representing letters like é, ß, ü, ä, ö or å in ASCII, so we can’t use it.

“But look,” said the European, “in a common computer where 1 byte equals 8 bits, the ASCII encoding is wasting a whole bit by always setting its value to 0! We can use this bit to stuff 128 more values ​​into that table!” And so they did. But even so, there are more than 128 way to mark a vowel. We cannot put all the letter variants used in the languages ​​of Europe into the same table with up to 256 values. And then the world was engulfed in a sea of ​​encodings, standards, de facto standards and even… semi-standards for different character sets. Someone needed to write a document about Swedish in Czech, couldn’t find an encoding that would apply to both languages, so had to make one. And it happened a thousand times.

And also don’t forget Russian, Indian, Arabic, Hebrew, Korean and thousands of other languages ​​spoken on earth. Not to mention languages ​​that have been deprecated. Once you have solved the problem of how to write multiple languages ​​in the same text with the above languages, challenge yourself in Chinese. Or Japanese. Both languages ​​contain tens of thousands of characters. You have up to 256 values ​​in a byte containing 8 bits. Trade!

Multi-Byte Encodings

To create a character-to-letter association table for a language with more than 256 characters, one byte is simply not enough. With 2 bytes (16 bits) we can encode up to 65,536 different characters. BIG-5 is an encoding that uses that. Instead of splitting a string of bits into block 8, it splits into block 16 and has a giant table (I mean, HUGE) that dictates which characters to associate with which bit string. BIG-5 in its simplest form already handles most of the characters of traditional Chinese. GB18030 is another encoding that takes a similar approach, but it includes both Simplified and Traditional Chinese. And before you ask, yes, there are other encodings for Simplified Chinese only. I only want to use 1 encoding, but is it that difficult?

Here is a small part of the GB18030 encoder table:

bitscharacter 10000001 01000000 10000001 01000001 10000001 01000010 10000001 01000011 10000001 01000100

GB18030 handles a large number of characters (including most Latin characters), but in the end it is just a specialized encoding format among others.

Unicode Confusion

Finally, someone could bear it and stood up to create a coding standard that would unify all other standards. This standard is called Unicode. It basically defines a maximum large table with 1,114,112 code points that can be used for all types of letters and symbols. It is more than enough to encode all European, Middle Eastern, Far Eastern, Southern, Northern, Western, prehistoric and even future languages ​​that humans have not yet thought of. Using Unicode, you can compose text in nearly any language using any character you can type. This was either impossible or very difficult to do before Unicode was born. There is even an unofficial entry for Klingon (Star Trek) in Unicode. You see, Unicode is so big that it also allows personal use.

So how many bits does Unicode use to encode all those characters? 0. Because Unicode is not an encoding.

Confused? Many people seem so. First, Unicode defines a table containing the code points for the characters. As dangerous as it sounds, it’s like saying “65 stands for A, 66 for B and 9,731 for ☃” (really). How these code points are encoded into bits is another story. To hold 1,114,112 different values, 2 bytes is not enough. 3 bytes was enough, but nobody used 3 bytes, so 4 bytes were chosen in the end. But, unless you use Chinese or other languages ​​with a large number of characters that require a lot of bits to encode, you will never use most of those 4 bytes. If the letter “A” is always encoded as 00000000 00000000 00000000 01000001, “B” then becomes 00000000 00000000 00000000 01000010,.. any text will be 4 times larger than its actual size.

To optimize this problem, there are many ways to encode code points into bits. UTF-32 is an encoding that encodes any code point using 32 bits. That is, 4 bytes per character. It is very simple, but often takes up too large a size. UTF-16 and UTF-8 are two types of multi-length encoding. If a character can be encoded with 1 byte (because its code point is a very small number), UTF-8 will encode it with 1 byte. If the character takes 2 bytes, it will encode 2 bytes, and so on. When decoding, the first byte in the string will be used to determine the number of bytes constituting the character, specifically:

String starts with bit pattern “0” (0x00-0x7f) => string 1 byte long. String starts with bit pattern “110” (0xc0-0xdf) => string is 2 bytes long. String starts with bit pattern “1110 ” (0xe0-0xef) => string is 3 bytes long. String starts with bit pattern “11110” (0xf0-0xf7) => string is 4 bytes long.

Using the most significant bit (MSB) as the string length message can help reduce memory loss, but can still be expensive if used too often. UTF-16 is more balanced, uses less at least 2 bytes, will increase to 4 bytes if needed.

And that’s all about Unicode. Unicode is a large table for the purpose of associating characters with numbers, and the different UTF encodings specify how these numbers are encoded as bits. Basically, Unicode is just one of the encoding schemes and there’s nothing special about it except that it tries to handle everything while still being efficient. And that’s a very good thing.™

Code Points

Characters are represented by its “code point”. Code points are written in hexadecimal (to make it shorter), prefixed with “U+” (which has no meaning other than to imply this is a Unicode code point). For example, the character Ḁ has a code point of U+1E00. In other words, it is character number 7680 of the Unicode table. Its official name is “LATIN CAPITAL LETTER A WITH RING BELOW”.

Too long, too shy to read

A little summary of the above: Any character can be encoded as many different bit strings, and any bit string can represent different characters, depending on what encoding is used to write it. them out. The reason is simply because different encodings use a different number of bits for each character and different values ​​represent different characters.

See also: What is the Ams Fee – Ams . Anywhere Export Containers Go

(End of part 1)

The article is translated from What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.