Unicode

ASCII is a 7-bit 128 character encoding scheme that encodes the basic latin alphabet, arabic numerals, the dollar sign, some punctuation and symbols, some control codes that originated with Teletype machines and a blank space. It’s a pretty limited character set. (But see here →)

Later as 8-bit variants of ASCII became possible the character set was extended in the ISO-8859-1 standard to include letters with diacritics and other characters required for west European languages and some mathematical and currency symbols. ISO-8859-1 also defines the first 256 characters in Unicode. Windows-1252 was an extension to ISO-8859-1 designed by Microsoft which added some characters needed for traditonal printing (em and en dashes, ‘curly’ quotes, daggers, single guillemets), also the Euro symbol and some additional letters with diacritics. Meanwhile, Apple computers prior to OS X used the Mac-Roman encoding, which contained most of the same characters as ISO-8859-1 and Windows-1252, but in a different arrangement. So, even just within the Latin script there were several conflicting ways of encoding glyphs. Hence the need for a large encoding system with a unique code point for every character.

Unicode is the universal character set, for the encoding of all of the world’s writing systems and also non-alphabetical glyphs, such as mathematical symbols, musical symbols, dingbats etc (→). It is used in most modern operating systems, and as UTF-8 is now the dominant character encoding on the World Wide Web (→). Unicode version 16.0 defines over 154,000 characters.

ASCII was incorporated into the Unicode character set as the first 128 symbols, so the 7-bit ASCII characters have the same numeric codes in both sets. This allows UTF-8 to be backward compatible with 7-bit ASCII, as a UTF-8 file containing only ASCII characters is identical to an ASCII file containing the same sequence of characters.

In XeTeX input files are assumed to be in UTF-8, so characters from any script encoded by Unicode can be entered in the input file and will appear in the output (provided that the font you’re using has the required glyphs). The TeX ‘special characters’ –
\ { } $ # % & _ ^ ~
– still have to be entered in the following way though:
$\backslash$ $\{$ $\}$ \$ \# \% \& \_ \^{} \~{}.

The traditional TeX mark up for quote marks and dashes and for Spanish punctuation, whereby ` ', `` '', --, ---, !`, ?` are changed to ‘ ’, “ ”, –, —, ¡, ¿ in the output, can still be used if you have mapping=tex-text in your font declaration.

The following Plain TeX control sequences:

\`o \'o \^o \"o \~o \=o \.o \v o \u o \H o \t oo \c c \d o \b o \oe\ \OE\ \ae\ \AE\ \aa\ \AA\ \o\ \O\ \l\ \L\ \ss\ \dag\ \ddag\ \S\ \P\ \copyright

for entering accented letters and symbols are redundant in XeTeX (they don’t seem to work properly).* The characters can be entered directly in the input file:

ò ó ô ö õ ō ȯ ǒ ŏ ő ç ọ œ Œ æ Æ å Å ø Ø ł Ł ß † ‡ § ¶ ©.

The only ones that don’t have exact Unicode equivalents as pre-composed glyphs are \t oo ‘tie-after accent’ (double inverted breve), \d o ‘dot below’ and \b o ‘bar below’. But Unicode does have U+0361 ‘combining double inverted breve’ o͡o, U+0323 ‘combining dot below’ ọ and U+0331 ‘combining macron below’ o̱. There are also some pre-composed Unicode glyphs with underdots.

* \dag \ddag \S \P work, but the output is always in Computer Modern. \copyright works, but the ‘c’ is not centered properly in the circle. \d and \t seem to work quite well.

All those characters can be entered in the .tex file. But how do you get them there?
input.tex, input.pdf

Unicode and OpenType

Although OpenType has tags for e.g. small caps, swash italics, discretionary and historical ligatures, there are no Unicode code points for most of these variants. This is because Unicode is only concerned with encoding linguistically meaningful signs, not their typographic forms and permutations. So, for Unicode, having code points for a small cap ‘a’ would be like having separate code points for an italic ‘a’, or a bold ‘a’ etc. Similarly with numerals, although OpenType fonts can have lining tabular, lining proportional, oldstyle tabular, oldstyle proportional, there are only Unicode code points for one set. This will be the default set of numerals in the font, usually lining tabular.

Unicode isn’t entirely consistent though. It does have code points for f ligatures and some others: (ﬀ ﬁ ﬂ ﬃ ﬄ ﬅ ﬆ). It also has code points for superscripts and subscripts, in the ‘Latin-1’ and ‘Superscripts and Subscripts’ Unicode blocks; numerals in circles and brackets or followed by dots (①②③⑴⑵⑶⒈⒉⒊) in ‘Enclosed Alphanumerics’ and more series of numerals for mathematical use in ‘Mathematical Alphanumeric Symbols’. There are also pre-composed fractions in ‘Latin-1’ and ‘Number Forms’: (→)

If you copy and paste text from the PDF of the following file into a text editor or word processor, you will see that a lot of the glyphs accessed through OpenType tags do not copy properly (it varies depending on the font).

compatibility.tex, compatibility.pdf

That’s not really a problem if the PDF is to be printed or viewed on the web. But it’s a bit disconcerting if you copy and paste the text from the PDF into a word processor and all the f ligatures disappear.

Unicode

Unicode and OpenType

←