Don's Home Character Sets, Fonts, Typography Unicode Contact

Unicode Covers All Major Living Languages

Unicode is a method of encoding characters in computers. UTF stands for Unicode Transformation Format.
Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. Unicode is required by modern standards such as XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, etc., and is the official way to implement ISO/IEC 10646.

There 8, 16 and 32 bit Unicode transformation formats (UTF)

  • UTF-7 - For 7-bit environments
  • UTF-8 is the byte-oriented encoding form of Unicode. It uses anywhere between one and four bytes to encode a character.
  • UTF-16 Uses two bytes for the characters inside the basic multilingual plane (BMP) U+0000 to U+FFFF. UTF-16 is popular because many APIs date to the time when Unicode was 16-bit fixed width.
    UTF-16 is divided into a total of 17 code areas, each with 65,536 characters (16 bits) Characters outside the BMP may require surrogate pairs in UTF-16.
    4 bits indicate the plane, for a total of 20 bits.
      Plane 0 is the BMP Unicode is divided into a total of 17 code areas, each with 65,536 characters (16 bits)
      Plan1 is the Supplementary Multilingual Plane (SMP) U+10000 to U+1FFFF
    • UTF-32 Uses four bytes for all characters, so you don't have to worry about characters outside the BMP.
    • UTF-EBCDIC is a character encoding used to represent Unicode characters. It is meant to be EBCDIC-friendly, so that legacy EBCDIC applications on mainframes may process the characters without much difficulty.
    • GB18030 (Simplified Chinese) can be considered a Unicode Transformation Format (i.e. an encoding of all Unicode code points) that maintains compatibility with a legacy character set. In other words, it is a Chinese equivalent of UTF-8

    UTF-16 and UTF-32 are not byte oriented and so a byte order must be selected when transmitting them over a byte oriented network or storing them in a byte oriented file.

    Some systems store data with most significant byte (MSB) first (big-endian) and others with it last (little-endian). A byte order mark (BOM) consists of the character code U+FEFF at the beginning of a data stream, where it can be used as a signature defining the byte order and encoding form, primarily of unmarked plaintext files.

    See: unicode.org, Allan Wood's Unicode Page Reference.com (Table of Unicode characters, 128 to 999 Table of Unicode characters from 1 to 65535 at unicode.coeurlumiere.com

    Some of the Languages in the Unicode Character Database (UCD) . See a larger list at Alan Wood's Unicode and Multilingual Support in HTML
    Armenian
    Bengali
    Block Elements
    Bopomofo
    Bopomofo Extended
    Box Drawing
    Braille Patterns
    Canadian Aboriginal Syllabics
    Cherokee
    CJK Chinese/Japanese/Korean
    Dingbats
    Ethiopic
    Georgian
    Hangul Compatibility Jamo
    Hangul Jamo
    Hangul Syllables
    Hebrew
    High Surrogates
    Hiragana
    Ideographic Description Characters
    IPA Extensions
    Kanbun
    KangXi Radicals
    Kannada
    Katakana
    Khmer
    Lao
    Malayalam
    Mathematical Operators
    Miscellaneous Symbols
    Miscellaneous Technical
    Mongolian
    Myanmar
    Number Forms
    Ogham
    Oriya
    Runic
    Sinhala
    Syriac
    Tamil
    Telugu
    Thaana
    Thai
    Tibetan
    "Unified Canadian
    Aboriginal Syllabics"
    Yi Radicals
    Yi Syllables
    CJK Chinese/Japanese/Korean come in several Unicode subsets: CJK compatibility, CJK Unified Ideographs (Extension A & B), CJK Compatibility Ideographs, CJK compatibility Forms, CJK Symbols and Punctuation, CJK Radicals Supplement, CJK Compatibility Ideographs Supplement, CJK Miscellaneous

    ISO/ANSI vs characters

    Characters 33-126 (letters, numbers and special characters (standard keyboard characters) are the same for ANSI and ECS, however the other characters are not the same. Eg. the British Pound character is 156 in ECS and 163 in ANSI.
    UTF-8 is an 8 bit code (256 values) that contains only ASCII characters and is identical to an ASCII file which represents the roman letters (upper and lower case), numbers, punctuation and control characters.

    Unicode is a 16-bit character set (65,536 values) designed to cover all the world's major living languages, in addition to scientific symbols and dead languages that are the subject of scholarly interest. It also includes emojis It eliminates the complexity of multibyte character sets that are currently used on UNIX and Windows to support Asian languages. Unicode was created by a consortium of companies including Apple, Microsoft, HP, Digital and IBM and merged its efforts with the ISO-10646 standard to produce a single standard in 1993. Unicode is already the basis for at least one operating system: Windows/NT.

    they are represented by U+xxxx, where x is a hexadecimal character 0-F, where F represents decimal 16.
    e.g. the Greek letter Π (Pi) is unicode U+03A0 and coded as Π in web browsers and other places, where 928 is the decimal equivalent of Hex 03A0.
    HTML browsers also support character entity names like Π Π

    A generic white grinning face emoji code is U+263A ☺ .
    However they ran out of codes in the BMP for pictographs like emojis so newer ones are in the Supplementary Multilingual Plane U+10000 to U+1FFFF.
    So a colored grinning face 😀 is U+1F600 can be coded as 😀 or 😀

    Unicode is a 16-bit character set where all characters occupy the same space. The first 256 values are the same as the ISO-Latin character set, which is also the basis for the ANSI Character set used in Windows 3.1 and Windows 95. But Unicode goes on to define 34,168 distinct coded characters. In most character sets a single value is often assigned to several characters. For example, in ASCII a "-" is used to represent a hyphen, a minus sign, a dash and a non-breaking hyphen. In Unicode each meaning is given its own code. The Unicode standard contains only one instance of each character and assigns it a unique name and code value. It also supports "combining" accent characters, which follow the base character that they are to modify.

    See Also: UTF-8 Encoding | FileFormat.info
    Robelle
    Unicode - Unihan for Chinese, Japanese, Korean ...
    How to display Chinese (CJK) in Unicode