Character Properties, Case Mappings & Names
Q: Do all scripts have upper and lower
case?
No, as a matter of fact, most scripts do not have cases.
[JR]
Q: Do the case mappings in Unicode allow a
round-trip?
A: No, there are instances where two characters map to the
same result. For example, both a sigma and a final sigma uppercase to a
capital sigma. There are other cases where the uppercase of a character
requires decomposition. In some cases, the correct mapping also depends
on the locale. For example, in Turkish, an i maps to an
uppercase dotted I. [MD]
Q: Doesn't this cause a problem?
A: Remember that in general, case mappings of strings lose
information and thus do not allow round tripping. Take the word "anglo-American"
or the Italian word "vederLa". Once you uppercase, lowercase or
titlecase these strings, you can't recover the original just by
performing the reverse operation. [MD]
Q: Why aren't there extra characters to
support locale-independent casing for Turkish?
A: The fact is that there is too much data coded in 8859-9
(with 0xDD = LATIN CAPITAL LETTER I WITH DOT and 0xFD = LATIN SMALL
LETTER DOTLESS I) which contains both Turkish and non-Turkish text.
Transcoding this data to Unicode would be intolerably difficult if it
all had to be tagged first to sort out which 0x49 characters are
ordinary "I" and which are CAPITAL LETTER DOTLESS I. Better to accept
the compromise and get on with moving to Unicode. Moreover, there is a
strong doubt that users will "get it right" in future either when they
enter new characters. [JC]
Q: Why is there no upper-case SHARP S (ß)?
A: There are 139 lower-case letters in Unicode 2.1 that
have no direct uppercase equivalent. Should there be introduced new
bogus characters for all of them, so that when you see an "fl" ligature
you can uppercase it to "FL" without expanding anything? Of course not.
Note that case conversion is inherently language-sensitive,
notably in the case of IPA, which needs to be left strictly alone even
when embedded in another language which is being case converted. The
best you can get is an approximate fit. [JC]
Q: Is all of the Unicode case mapping
information in
UnicodeData.txt?
A: No. The
UnicodeData.txt file includes all of the 1:1 case mappings, but
doesn't include 1:many mappings such as the one needed for
uppercasing ß. Since many parsers now expect this file to have at most
single characters in the case mapping fields, an additional file (SpecialCasing.txt)
was added to provide the 1:many mappings. For more information,
see UTR #21- Case
Mappings [MD]
Q: Near the end of the SpecialCasing.txt,
there are the two lines on SIGMA that look weird to me. Can you explain
them:
# 03C3; 03C2; 03A3; 03A3; FINAL; # GREEK SMALL LETTER SIGMA
# 03C2; 03C3; 03A3; 03A3; NON_FINAL; # GREEK SMALL LETTER FINAL SIGMA
A: Both of these are conditional (column 5); that is, in
normal Greek text a 03C3 (non-final sigma) should be written as 03C2
(final sigma) if it is at the end of a word, and a 03C2 (final sigma)
should be written as a 03C3 (non-final sigma) if it is not at the end of
a word. That's what these two lines would mean if they were uncommented.
However, they are commented, just for that reason: the SpecialCasing
file is not intended to normalize the appearance of a small sigma.
[MD]
Q: Where are private use characters used,
and how should they be handled?
A: Private use characters (also known as user defined
characters) are used commonly in East Asia, particularly Japan, China,
and Korea, to extend the available characters in various national
standard and vendor character sets. The Unicode Standard also makes
provision for private use characters. Since the Unicode Standard
includes so many more standard characters than any other character
encoding, there is less of a requirement for private use characters than
in a typical legacy character set; however, there are occasionally cases
where characters that are not yet in the standard need to be represented
by codepoints in the Private Use Area (PUA). Some private use characters
may never get standard encodings for one reason or another. Also, a
particular implementation may choose to use private use characters for
specific internal purposes.
It is relatively easy for Input Method Editors (IME) to
allow private use characters to be added in the PUA, keeping track of
the text sequence that should convert to those private use characters.
With modern font technologies such as OpenType and AAT, these characters
can also be added to fonts for display. However, the same codepoints in
the PUA may be given different meanings in different contexts, since
they are, after all, defined by users and are not standardized. If text
comes, for example, from a legacy NEC encoding in Japan, the same
codepoint in the PUA may mean something entirely different if
interpreted on a legacy Fujitsu machine, even though both systems would
share the same standard codepoints. For each given interpretation of a
private use character one would have to pick the appropriate IME user
dictionary and fonts to work with it.
One should not expect the rest of an operating system to
override the character properties for these private use characters,
since private use characters can have different meanings, depending on
how they originated. In terms of line breaking, case conversions, and
other textual processes, private use characters will typically be
treated by the operating system as otherwise undistinguished letters (or
ideographs) with no uppercase/lowercase distinctions.
[MD] and
[KW]
Q: The character name for the control
character U+0082 is BREAK PERMITTED HERE. Does that mean I have to
interpret that control character in that way?
A: The character names are actually undefined, and
simply marked by "" to indicate their functional use. What you
are thinking of as names are marked as aliases pointing to the ISO 6429
usage, as in
http://www.unicode.org/charts/PDF/U0080.pdf.
The Unicode Standard does not define U+0082 to mean "BREAK
PERMITTED HERE". It just says that it is a control code, one which in
ISO 6429 has that name and meaning. Implementers of the Unicode Standard
are not required to interpret the U+0082 in accordance with ISO 6429 (or
to interpret it at all).
The standard does assign particular properties and
semantics for the high-use controls, including tab, carriage return,
line feed, form feed, and next line. But it does not give the majority
of control codes any semantics at all; that is left to a higher-level
protocol. [MD]
Q: Where can I find formal definitions of
the terms used in the Character Name field of the UnicodeData.txt file?
Most specifically, precise explanations of designations like "turned",
"inverse", "inverted", "reversed", "rotated"
A: These terms are basically typographical rather than
Unicode-specific.
A turned character is one that has been rotated 180 degrees
around its center. A turned "e" winds up with the opening in the upper
left portion. U+0259 LATIN SMALL LETTER SCHWA is a turned "e".
An inverted character has been flipped along the horizontal axis. An
inverted "e" winds up with the opening in the upper right portion. There
is no Unicode character representing an inverted "e". A reversed
character has been flipped along the vertical axis.
A reversed "e" winds up with the opening in the lower left portion.
U+0258 LATIN SMALL LETTER REVERSED E is an reversed "e".
A rotated character has been rotated 90 degrees, but one can't tell
which way without looking at the glyph. U+213A ROTATED CAPITAL Q is a
"Q" that has been rotated counterclockwise.
"Inverse" means that the white parts of the glyph are made black, and
vice versa. An inverse "e" looks like a normal "e" but is white on a
black background. There is no Unicode character representing an inverse
"e". [JC]
Q: Are any unassigned characters or reserved characters
given default properties?
A: The Bidi Algorithm (UAX #9)
gives different default Bidi_Class property values to certain ranges of unassigned codepoints: see the discussion
of the Bidi Class in UCD.html
for details. This is different than the general policy of giving a single default
value to all unassigned codepoints. Also look at the UCD file
DerivedBidiClass.txt which assigns Bidi_Class values to the
unassigned codepoints (anything not mentioned in that file belongs to
class L).
Note: for each Unicode property, UCD.html also summarizes
where to find the data for the property values, and the default value used for unassigned characters.
Q: Unicode now treats the SOFT HYPHEN as format control (Cf) character
when formerly it was a punctuation character (Pd). Doesn't this break ISO
8859-1 compatibility?
A: No. The ISO 8859-1 standard defines the SOFT HYPHEN as "[a] graphic
character that is imaged by a graphic symbol identical with, or similar to,
that representing hyphen" (section 6.3.3), but does not specify details of
how or when it is to be displayed, nor other details of its semantics. The
soft hyphen has had a long history of legacy implementation in two or more
incompatible ways.
Unicode clarifies the semantics of this character for Unicode
implementations, but this does not affect its usage in ISO 8859-1
implementations. Processes that convert back and forth may need to pay
attention to semantic differences between the standards, just as for any
other character.
In a terminal emulation environment, particularly in ISO-8859-1 contexts,
one could display the soft hyphen as a hyphen in all circumstances. The
change in semantics of the Unicode character does not require that
implementations of terminal emulators in other environments, such as ISO
8859-1, make any change in their current behavior.
Q: Where can I find the numerical values of characters with the Hexadecimal
Digit (Hex_Digit) property?
A: The Unicode Standard provides the Hex_Digit property,
which
specifies which characters are hexadecimal digits: 0-9, A-F, a-f, and
their fullwidth equivalents. (The ASCII_Hex_Digit property specifies
the intersection of the Hex_Digit property and the Basic Latin block.)
There is no table in the UCD mapping the hexadecimal digit characters to
their values, analogous to the Numeric_Value property.
The
table linked here removes this real, if trivial, gap. [JC]
Q: Why is the hacek accent called "caron"
in Unicode?
A: Nobody knows.
Legend has it that the term was first spotted in one of the 'giant
books' from the '30s at Mergenthaler Linotype Company in Brooklyn, NY;
but no one has been able to confirm that.
More accurate reports trace the term back to the mid '80s where we do
have documented sightings of "caron" in publications such as:
-
The TypEncyclopedia by Frank Romano, ISBN: 0-835-21925-9, Libraries
Unlimited; 1984
p. 6 shows the mark with the notation "caron/hacek/clicka"
IBM's Green Book which has an original copyright date of 1986.
"Caron Accent" appears on p. K-432, in a table entitled "Diacritic Mark
Special Graphic Characters."
National Language Support Reference Manual.
4th ed. 1994. (National Language Design Guide, 2)
-
SGML & Adobe documentation in
this 1986 reference
Unicode and ISO 8859-x just carried the tradition along.
In an article published in 2001: "Orthographic
diacritics and multilingual computing",
J.C. Wells - a linguist at the University College in London - writes:
"The term ‘caron’, however, is wrapped in mystery. Incredibly, it seems
to appear in no current dictionary of English, not even the OED."
Whoever the originator is, we suspect that he has probably taken his
secret to the grave by now. [Various authors]