FAQ: What is a code page?


What is a code page?


For a computer to process text, characters must be represented by numeric values. A special encoding scheme, or a set of characters with numeric index assigned to each character in a certain order, is used to map the characters that are input from the keyboard. This encoding scheme is called a code page and the numeric index associated with each character is called a code point value. Think of a code page as an organized table containing a collection of characters, called the character set, which computers use to process text, allowing operating systems to distinctively identify a character through its corresponding code point value. For example, when a person types the euro currency symbol '€' in a language environment that uses the Windows 1252 code page, which is the code page that covers most of the West European and English languages on Windows, a code point value of '0xA4' is registered; and when the character '€' is saved, the data actually being written to disk will be the code point value '0xA4'.

Most language groups and the operating systems that support the languages have a unique set of characters and code page to accommodate the letters used by that set of characters. Each operating system has an encoding scheme which maps the code point values to a specific character. In other words, code pages are defined to represent and support a language or set of languages that share common writing systems. For example, the Danish, Dutch, English and German languages can be represented by the American National Standards Institute's (ANSI) 1252 code page in Windows and the Chinese language can be represented by an Extended UNIX Character (EUC) code page in UNIX.

Various writing systems use characters that others don't; they have their own character sets so there are different code pages that support them. For example, in the ANSI 1252 code page, used primarily in English and most Western European languages, the code point value of '202' in decimal value (CA in hexadecimal value) represents the character '', but in the ANSI 1253 code page (used in Greek) the same code point value represents the character 'K'. The following is a list of Windows code pages:

Single Byte Character Set (SBCS):
1250: Windows Latin 2 (Central Europe)
1251: Windows Cyrillic
1252: Windows Latin 1 (ANSI)
1253: Windows Greek
1254: Windows Latin 5 (Turkish)
1255: Windows Hebrew
1256: Windows Arabic
1257: Windows Baltic
1258: Windows Vietnamese
874: Windows Thai

Double Byte Character Set (DBCS):
932: Japanese Shift-JIS
936: Simplified Chinese GBK
949: Korean
950: Traditional Chinese Big5

In each code page, the set of characters numbered 32 through 127 (0x20 through 0x7F in hexadecimal value) are identical and are called the ASCII character set, -- the 7-bit ASCII character set are included in all of the code pages with the same code point values assigned. As the name implies, a character set is a collection of characters. ASCII (an acronym for American Standard Code for Information Interchange) character set standard is a Western character set standard and is the one common denominator contained in all the other common character sets. In other words, characters included in the ASCII character set are the only ones that are used across all other common character sets. ASCII character set is composed of 128 (7-bit or 2 to the 7th power) characters and contains English, American English punctuation, base 10 numbers and a few control characters (tab, escape, shift-in, etc.). The following lists the printable ASCII characters:

Lowercase Roman: abcdefghijklmnopqrstuvwxyz
Numerals: 0123456789
Symbols: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
[Single Byte Character Set]
Languages which require less than 256 characters can be represented by an 8-bit (one-byte) character set. This is because a bit can hold one of two possible values (usually a '1' or a '0'), hence an 8-bit (2 to the 8th power), or a single byte, character set can contain up to 256 distinct characters. Many languages including the Western European languages such as English, French, German, Italian and Spanish can be represented within 256 characters, and therefore, use 8-bit character sets. Although a single 8-bit character set can not represent all the European languages (since there will be far more than 256 characters when combined all together), some languages do share a common character set.

[Extended Characters]
Eight-bit representation can handle 128 more characters than the 7-bit representation and is called an extended character set. The extended character set use the ASCII characters as their common base and include additional 128 characters beyond the upper 128 ASCII positions. The characters numbered 128 through 255 (0x80 through 0xFF in hexadecimal value) are called extended characters, or accented characters, and vary from code page to code page.

[Double Byte Character Set]
Some languages that use ideographic characters, such as Chinese (Traditional and Simplified), Japanese and Korean, have thousands of characters and require more than 256 characters. Because a single byte is not sufficient to encode all the characters, multi-byte character sets were created for these languages. Double-byte character sets are used frequently to describe these languages, but in actuality they are a mixture of single-byte and double byte characters; hence the term 'multi-byte characters' is used often to describe the characters for the Far-Eastern languages. Double-byte (two-bytes) is equivalent to 16 bits (2 to the 16th power), therefore in theory can provide up to 65,536 unique values.