\title{Unicode} Those of us who have been working with text for some time are familiar with the many different ways in which accents, other diacritics and non-alphanumeric symbols are coded, not to mention non-Latin texts like Arabic, Kanji, and all the other languages in which the staff at the School of Oriental and African Studies manage to produce setting. Of course, we all use \ASCII\ (or \EBCDIC, which is not very different), unless you happen to use Locoscript, but there is no standard way of representing characters above \ASCII\ 127. The DOS high-level \ASCII\ characters have become a sort of standard on pcs, but certainly cannot be assumed in any text file. Unicode is a new character coding system based on \ASCII. It has been produced by a consortium, Unicode Inc., which includes Apple, Xerox, IBM, Microsoft, SUN, Novell, Aldus, and NeXT, and is a 16-bit system. It therefore allows 65,536 characters to be coded. This number is still not enough to include all the Chinese, Japanese and Korean traditional characters (alphabets is the wrong word here), which together add up to about 125,000 symbols, although eliminating duplicates in the different languages reduces this to about 36,000 characters in common use. This is still considered to be too many when taken in conjunction with the other symbols required. Unicode then uses a process called `unification' so that each character is given a code. Irrespective of what language the character is in, what it means, or how it is pronounced, it will still have the same code. This is similar to the way that \ASCII\ codes letters without any reference to how they are pronounced in different languages, except that in the Eastern languages, the `Han' characters represent words rather than letters. After unification, there are about 18,000 characters and the total `character set' now stands at about 25,000, which leaves plenty of room for `non-unified' Han characters and other alphabets (have they include Amharac, for example?). Incidentally, the first 128 characters correspond to \ASCII. This is a relief and seems obvious, but how often has the obvious not been what is produced? Nonetheless a translation table or program will be necessary to go from 16-bit to 7- or 8-bit coding, or vice-versa. Most of the vendors involved in the Unicode project plan to produce systems which incorporate Unicode. Unfortunately, however, the situation is not that simple; there is another system, which the International Standards Organisation (ISO) has been working on. The standards committee has produced a Draft International Standard (DIS 10646), which takes an opposite approach to Unicode, retaining the national character sets. The coding here is 32-bit and the first eight bits indicate the character set, with the remaining 24 indicating the character. $2^{24}$ is nearly 17 million, so even the Chinese character set can be included with ease. This format is intended to maintain compatibility with existing standards. Which system will be adopted? The final aims of the two groups are eventually the same, but their intermediate aims are different, to have machine-independent coding (Unicode) and to be compatible with existing standards (ISO). It is interesting to note that the Japanese national standards group has voted against DIS 10646 because it rejects Han unification, which they feel is so vital that they have developed their own unifying standards. Time will tell, but if past experience is anything to go by, then it is the ad-hoc standard, available on the hardware, which will be adopted. Nonetheless, we will still require translation programs (and people to write them) for some time yet! \author{David Penfold}