Unicode Covers All Major Living Languages
Unicode is a method of encoding characters in computers.
UTF stands for Unicode Transformation Format.
Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.
There 8, 16 and 32 bit Unicode transformation formats (UTF)
- UTF-7 - For 7-bit environments
- UTF-8 is the byte-oriented encoding form of Unicode. It uses anywhere between one and four bytes to encode a character.
- UTF-16 Uses two bytes for the characters inside the basic multilingual plane (BMP) U+0000 to U+FFFF. UTF-16 is popular because many APIs date to the time when Unicode was 16-bit fixed width.
UTF-16 is divided into a total of 17 code areas, each with 65,536 characters (16 bits) Characters outside the BMP may require surrogate pairs in UTF-16.
4 bits indicate the plane, for a total of 20 bits.
Plane 0 is the BMP Unicode is divided into a total of 17 code areas, each with 65,536 characters (16 bits)
Plan1 is the Supplementary Multilingual Plane (SMP) U+10000 to U+1FFFF
- UTF-32 Uses four bytes for all characters, so you don't have to worry about characters outside the BMP.
- UTF-EBCDIC is a character encoding used to represent Unicode characters. It is meant to be EBCDIC-friendly, so that legacy EBCDIC applications on mainframes may process the characters without much difficulty.
- GB18030 (Simplified Chinese) can be considered a Unicode Transformation Format (i.e. an encoding of all Unicode code points) that maintains compatibility with a legacy character set. In other words, it is a Chinese equivalent of UTF-8
UTF-16 and UTF-32 are not byte oriented and so a byte order must be selected when transmitting them over a byte oriented network or storing them in a byte oriented file.
Some systems store data with most significant byte (MSB) first (big-endian) and others with it last (little-endian).
A byte order mark (BOM) consists of the character code U+FEFF at the beginning of a data stream,
where it can be used as a signature defining the byte order and encoding form, primarily of unmarked plaintext files.
See: unicode.org, Allan Wood's Unicode Page
Reference.com (Table of Unicode characters, 128 to 999
Table of Unicode characters from 1 to 65535 at unicode.coeurlumiere.com
Some of the Languages in the Unicode Character Database (UCD) . See a larger list at Alan Wood's Unicode and Multilingual Support in HTML
CJK Chinese/Japanese/Korean come in several Unicode subsets:
CJK compatibility, CJK Unified Ideographs (Extension A & B), CJK Compatibility Ideographs, CJK compatibility Forms, CJK Symbols and Punctuation, CJK Radicals Supplement, CJK Compatibility Ideographs Supplement, CJK Miscellaneous
Canadian Aboriginal Syllabics
|Hangul Compatibility Jamo|
Ideographic Description Characters
ISO/ANSI vs characters
Characters 33-126 (letters, numbers and special characters (standard keyboard characters)
are the same for ANSI and ECS, however the other characters are not the same. Eg. the
British Pound character is 156 in ECS and 163 in ANSI.
UTF-8 is an 8 bit code (256 values) that contains only ASCII characters and is identical to an ASCII file which represents the roman letters (upper and lower case), numbers, punctuation and control characters.
Unicode is a 16-bit character set (65,536 values) designed to cover all the world's major living languages, in addition to scientific symbols and dead languages
that are the subject of scholarly interest. It also includes emojis It eliminates the complexity of multibyte character sets that are currently used on UNIX and
Windows to support Asian languages. Unicode was created by a consortium of companies including Apple, Microsoft, HP, Digital and IBM
and merged its efforts with the ISO-10646 standard to produce a single standard in 1993. Unicode is already the basis for at least one
operating system: Windows/NT.
they are represented by U+xxxx, where x is a hexadecimal character 0-F, where F represents decimal 16.
e.g. the Greek letter Π (Pi) is unicode U+03A0 and coded as Π in web browsers and other places, where 928 is the decimal equivalent of Hex 03A0.
HTML browsers also support character entity names like Π Π
A generic white grinning face emoji ☺ code is U+263A ☺ .
However they ran out of codes in the BMP for pictographs like emojis so newer ones are in the Supplementary Multilingual Plane U+10000 to U+1FFFF.
So a colored grinning face 😀 is U+1F600 can be coded as 😀 or 😀
Unicode is a 16-bit character set where all characters occupy the same space. The first 256 values are the same as the ISO-Latin character
set, which is also the basis for the ANSI Character set used in Windows 3.1 and Windows 95. But Unicode goes on to define 34,168
distinct coded characters. In most character sets a single value is often assigned to several characters. For example, in ASCII a "-" is used to
represent a hyphen, a minus sign, a dash and a non-breaking hyphen. In Unicode each meaning is given its own code. The Unicode standard
contains only one instance of each character and assigns it a unique name and code value. It also supports "combining" accent characters,
which follow the base character that they are to modify.
UTF-8 Encoding | FileFormat.info
Unicode - Unihan for Chinese, Japanese, Korean ...
How to display Chinese (CJK) in Unicode