Don's Home Technology Character Sets, Fonts, Typography
See also typography

Character Sets and Languages

Character Sets (Language Scripts)

You may see the following tags in web pages:
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Other common charset Values:
  us-ascii & windows-1252 (IBM Extended Character Set - ECS)
  These are basically the same. Other character sets are listed below.
 The first 128 characters are control characters and 
 the standard letters, numbers, and special chaaracters (punctuation, +, /, =, ...).
 The extended characters (128-256)(80-FF octal) include multinational characters
 and things like cent "¢", pound "£", copyright "©".
 The 8-bit character file here has the printable characters from 32-255.
The extended characters are standardized for web pages, but when viewed in
native text editors may appear differently in different operating systems.
e.g. Windows, Macintosh OS, and UNIX.

Character Entity Ref. codes and UTF codes may be used for others.
See also Emojis

HTML also uses something called character entity references to specify characters (e.g. &copy; or &#169; for obtaining the copyright sign ©) The codes (e.g. 169) are the same as the Extended ANSI character codes.
Character Entity Ref. codes or UTF codes may be used for other characters such as greek letters and mathmatical symols.
More characters are available with unicode (UTF-16) character encoding. U+xxxx codes are Unicode values (UTF-8 and UTF-16), For example the Euro symbol "" may be coded as: &#x20AC; or &#8364;, where 8364 is the decimal equivalent of Hex 20AC. See The Problem with Em 'n Em and other special characters for a good explanation. Notation Notation for Space Octal Decimal Hex \040 32 \x20 URL's use Hex characters e.g. %20 for space. Character entity references use the decimal char e.g. &#32;

The most common standard is ANSI X3.4-1968 which is commonly called US-ASCII or simply ASCII is defined in RFC1345 is the same as ISO-8859-1 (Latin1).

International Standards Orginization (ISO) 8859 and Microsoft Codepages are other common standards.

BOLD - Prefered MIME Name
<
Table at Encoding.CodePage Property msdn.microsoft.com
Name      CodePage  BodyName    HeaderName   WebName    Encoding.EncodingName
shift_jis     932  iso-2022-jp  iso-2022-jp  shift_jis    Japanese (Shift-JIS)
windows-1250 1250  iso-8859-2   windows-1250 windows-1250 Central European1 Latin-2
windows-1251 1251  koi8-r       windows-1251 windows-1251 Cyrillic 
Windows-1252 1252  iso-8859-1   Windows-1252 Windows-1252 Western European Latin-1
windows-1253 1253  iso-8859-7   windows-1253 windows-1253 Greek 
windows-1254 1254  iso-8859-9   windows-1254 windows-1254 Turkish 
csISO2022JP  50221 iso-2022-jp  iso-2022-jp  csISO2022JP  Japanese (JIS-Allow 1 byte Kana)
iso-2022-kr  50225 iso-2022-kr  euc-kr       iso-2022-kr  Korean (ISO)
1. Latin-2 Central or East  European (Czech, Hungarian, Polish and Slovak)

B>US-ASCII (U.S. national variant of ISO/IEC 646. Formally, the U.S. standard
                                  ANSI X3.4.)
    ISO             Windows
ISO-8859-1 - 1252 - Latin-1 Westerm European (Default)ANSI 
ISO-8859-2 - 1250 - Latin-2 East European (Czech, Hungarian, Polish and Slovak)
ISO-8859-3 - Latin-3 South European
ISO-8859-4 - 1257 - Latin-4 North European (Baltic)
ISO-8859-10 - Latin-6 Nordic replaces latin-4 (Sámi, Inuit, Icelandic) 
ISO-8859-5 - 1251 - Cyrillic (Azerbaijani, Bulgarian, Buryat, Byelorussian, Karakalpak,
             Kazakh, Khalkha, Kirghiz, Macedonian, Moldavian, Russian, Serbian,
             Tajik, Turkmen, Ukrainian and Uzbek languages)
ISO-8859-6 - 1256 - Arabic (Arabic, Farsi [Iran], Jawi, Kurdish, Pashto [Afghanistan],
                   Persian, Sindhi and Urdu [Pakistan], Panjabi)
ISO-8859-7 - 1253 - Greek
ISO-8859-8 - 1255 - Hebrew
ISO-8859-9 - 1254 - Latin-5 Turkish 
                    1258 - Viet Nam 
ISO-8859-11 - 874 - Thai 
ISO-2022-KR - 949 - Korean Extended Wansung 
936 - PRC GBK (XGB) 
950 - Chinese (Taiwan, Hong Kong) 

 2 Byte (16 bit) codes
 UTF-8  - ISO-10646  Unicode
 Big5   - Chinese Traditional (Taiwan, HongKong)
 EUC-TW - Chinese Traditional
 GB2312  - Chinese Simplified (China mainland, Singapore and Malaysia)
 GB18030 is the registered Internet name for the official character set of
                the People's Republic of China (PRC) superseding GB 2312.
 GB - (GuoBiao) Chinese Simplified
 GBK - Chinese Simplified
 HZ - Chinese Simplified
 ISO-2022-GB - an emerging new international Internet standard for encoding Chinese text

 EUC-JP  Japanese
 EUC-JIS  Japanese
 Shift_JIS  Japanese
 ISO-2022-JP  Japanese
 ISO-2022-JP2  Japanese
 KSC5601 - Korean
 EUC-KR - Korean
 KSC5601
 KO18-R - Cyrillic
 KO18-U - Cyrillic
 Devanagari (Bhojpuri, Bihari, Hindi, Kashmiri, Konkani, Marathi, Nepali and Sanskrit.
            It is also used for writing Panjabi by Indians who are not Sikhs)
Gujarati  (Indian State of Gujarat)
Gurmukhi (Panjabi[Pakstan and  India]
           Panjabi can also be written with Devanagari and Arabic)
Bengali  Bengali [Bangladesh]

Thai
Vietnamese


 (Character sets recognized in Netscape 1.1 were us-ascii, iso-8859-1, iso-2022-jp,
 x-sjis, x-euc-jp, x-mac-roman)
 Central European
Unicode :
Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. Unicode is required by modern standards such as XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, etc., and is the official way to implement ISO/IEC 10646.

There 8, 16 and 32 bit Unicode transformation formats (UTF)

  • UTF-7 - For 7-bit environments
  • UTF-8 is the byte-oriented encoding form of Unicode. It uses anywhere between one and four bytes to encode a character.
  • UTF-16 Uses two bytes for the characters inside the basic multilingual plane (BMP). UTF-16 is popular because many APIs date to the time when Unicode was 16-bit fixed width. Characters outside the BMP may require surrogate pairs in UTF-16.
  • UTF-32 Uses four bytes for all characters, so you don't have to worry about characters outside the BMP.
  • UTF-EBCDIC is a character encoding used to represent Unicode characters. It is meant to be EBCDIC-friendly, so that legacy EBCDIC applications on mainframes may process the characters without much difficulty.
  • GB18030 (Simplified Chinese) can be considered a Unicode Transformation Format (i.e. an encoding of all Unicode code points) that maintains compatibility with a legacy character set. In other words, it is a Chinese equivalent of UTF-8

UTF-16 and UTF-32 are not byte oriented and so a byte order must be selected when transmitting them over a byte oriented network or storing them in a byte oriented file.

Some systems store data with most significant byte (MSB) first (big-endian) and others with it last (little-endian). A byte order mark (BOM) consists of the character code U+FEFF at the beginning of a data stream, where it can be used as a signature defining the byte order and encoding form, primarily of unmarked plaintext files.

See: unicode.org, Allan Wood's Unicode Page Reference.com (Table of Unicode characters, 128 to 999 Table of Unicode characters from 1 to 65535 at unicode.coeurlumiere.com

Some of the Languages in the Unicode Character Database (UCD) . See a larger list at Alan Wood's Unicode and Multilingual Support in HTML
Armenian
Bengali
Block Elements
Bopomofo
Bopomofo Extended
Box Drawing
Braille Patterns
Canadian Aboriginal Syllabics
Cherokee
CJK Chinese/Japanese/Korean
Dingbats
Ethiopic
Georgian
Hangul Compatibility Jamo
Hangul Jamo
Hangul Syllables
Hebrew
High Surrogates
Hiragana
Ideographic Description Characters
IPA Extensions
Kanbun
KangXi Radicals
Kannada
Katakana
Khmer
Lao
Malayalam
Mathematical Operators
Miscellaneous Symbols
Miscellaneous Technical
Mongolian
Myanmar
Number Forms
Ogham
Oriya
Runic
Sinhala
Syriac
Tamil
Telugu
Thaana
Thai
Tibetan
"Unified Canadian
Aboriginal Syllabics"
Yi Radicals
Yi Syllables
CJK Chinese/Japanese/Korean come in several Unicode subsets: CJK compatibility, CJK Unified Ideographs (Extension A & B), CJK Compatibility Ideographs, CJK compatibility Forms, CJK Symbols and Punctuation, CJK Radicals Supplement, CJK Compatibility Ideographs Supplement, CJK Miscellaneous

ISO/ANSI vs characters

Characters 33-126 (letters, numbers and special characters (standard keyboard characters) are the same for ANSI and ECS, however the other characters are not the same. Eg. the British Pound character is 156 in ECS and 163 in ANSI.

Microsoft

Micorsoft has defined the Windows Glyph List 4 (WGL4) standard, which incorporates codepages 1250 (Eastern Europe), 1251 (Cyrillic), 1252 (US English = ANSI), 1253 (Greek) and 1254 (Turkish).

Language:

<META HTTP-EQUIV="Content-Language" CONTENT="zh">
<HTML LANG="fr">
<BLOCKQUOTE LANG="fr">
<P LANG="fr">

Common values: (See: ISO639A List)
ar - Arabic, de - German, en - english, es - Spanish, fr - french, ga - Irish,
gu - Gujarati, he - Hebrew, hi - Hindi, it - Italian, ja - japanese, ko - korean,
pa - Punjabi, yi - Yiddish, zh - Chinese

They may also contain sub-tags e.g.:
fr-CA - French Canadian
ar-EG - Egyptian Arabic
en-US - American English
zh-TW - Taiwanese Chinese
zh-Hant - Traditional Chinese
zh-Hans - Simplified Chinese

Fonts

English: Times *, Times New Roman †, Helvetica *, Arial †, Courier *
* - Standard on Macintosh and UNIX, † - Standard on Windows
Chinese Traditional: MingLiU (IE5), PMingLiU (office 2000)
Chinese Simplified: MS Song  (IE5), MS Hei  (IE5), SimSun (Office 2000)
Japanese:  MS Gothic, MS Mincho
Korean: Gulimche
Web HebrewAD and Web Hebrew Monospace
Others: (See Alan Wood's list)
FZNew XiuLi-Z11
Andale Mono
Angsana New
Apple Chancery
Arabic Newspaper
Arabic Transparent
Arial
Arial Black
Arial GEO
Arial Narrow
Arial Unicode MS
Athena Roman
Ballymun RO
Batang
Bitstream CyberBase
Bitstream CyberBit
Bitstream CyberCJK
Book Antiqua
Bookman Old Style
Capitals
Caslon (Unix)
Caslon (Windows)
Century Gothic
Century Schoolbook
Charcoal
Chicago
CJK - Chinese/Japanese/Korean
ClearlyU
ClearlyU Arabic
Code2000
Comic Sans MS
Cordia New
Courier Mono Thai
Courier New
Courier New GEO
David
David Transparent
Ethiopia Jiret
Fixed Miriam Transparent
Franklin Gothic Book
Franklin Gothic Demi
Franklin Gothic Demi Cond
Franklin Gothic Heavy
Franklin Gothic Medium
Franklin Gothic Medium Cond
Gadget
Garamond
Geneva
Georgia
Georgia Greek
GF Zemen Unicode
Gulim Che
Haettenschweiler
Helvetica
Hoefler Text
Impact
Iris UPC
Lucida Console
Lucida Sans
Lucida Sans Typewriter
Lucida Sans Unicode
MgOldTimes UC Pol Normal
MingLiU
Miriam
Miriam Fixed
Miriam Transparent
Monaco
Monotype Corsiva
MS Gothic
MS Hei
MS Mincho
MS Song
Nesf
New York
NunacomU
Palatino
Palatino Linotype
PMingLiU
Rod
Sand
Sibal Devanagari
SIL Yi
SImPL
SimSun
Skia
TabAvarangal2
Tahoma (Macintosh)
Tahoma (Windows)
Techno
TektonPro
Textile
Thryomanes
Times
Times New Roman
Times New Roman GEO
TITUS Cyberbit Basic
Traditional Arabic
Verdana
Vusillus Old Face

See Also:
Fonts
Character set list at IANA ,ASCII - ISO 8859-1 (Latin-1) Table with HTML Entity Names, Examples of Characters, Keystrokes and Glyphs, Unicocd Standard, The Multilingual World Wide Web, Using National and Special Characters in HTML, Alan Wood's Demonstrations of Special Characters, HTML and JavaScript
Microsoft: Character sets
Encoding.CodePage Property (System.Text)
INFO: XML Encoding and DOM Interface Methods
ISO 8859 Alphabet Soup

Terms

ANSI - American National Standards Institute
Big5 - Character Set used for Traditional Chinese Characters
CJK - Chinese/Japanese/Korean
ECS - IBM Extended Character Set
EUC - Extended UNIX Code
GB (GuoBiao) - Character Set used for Simplified Chinese Characters
ISO - International Standards Organization
UCD - Unicode Character Database
UTC - Unicode Technical Committee
UTF - Unicode transformation format

Return to Technology

last updated 7 Feb 2007