[Unicode]  Frequently Asked Questions Home | Site Map | Search
 

Character Properties, Case Mappings & Names

Q: Do all scripts have upper and lower case?

No, as a matter of fact, most scripts do not have cases. [JR]

Q: Do the case mappings in Unicode allow a round-trip?

A: No, there are instances where two characters map to the same result. For example, both a sigma and a final sigma uppercase to a capital sigma. There are other cases where the uppercase of a character requires decomposition. In some cases, the correct mapping also depends on the locale. For example, in Turkish, an i maps to an uppercase dotted I. [MD]

Q: Doesn't this cause a problem?

A: Remember that in general, case mappings of strings lose information and thus do not allow round tripping. Take the word "anglo-American" or the Italian word "vederLa". Once you uppercase, lowercase or titlecase these strings, you can't recover the original just by performing the reverse operation. [MD]

Q: Why aren't there extra characters to support locale-independent casing for Turkish?

A: The fact is that there is too much data coded in 8859-9 (with 0xDD = LATIN CAPITAL LETTER I WITH DOT and 0xFD = LATIN SMALL LETTER DOTLESS I) which contains both Turkish and non-Turkish text. Transcoding this data to Unicode would be intolerably difficult if it all had to be tagged first to sort out which 0x49 characters are ordinary "I" and which are CAPITAL LETTER DOTLESS I. Better to accept the compromise and get on with moving to Unicode. Moreover, there is a strong doubt that users will "get it right" in future either when they enter new characters. [JC]

Q: Why is there no upper-case SHARP S (ß)?

A: There are 139 lower-case letters in Unicode 2.1 that have no direct uppercase equivalent. Should there be introduced new bogus characters for all of them, so that when you see an "fl" ligature you can uppercase it to "FL" without expanding anything? Of course not.

Note that case conversion is inherently language-sensitive, notably in the case of IPA, which needs to be left strictly alone even when embedded in another language which is being case converted. The best you can get is an approximate fit. [JC]

Q: Is all of the Unicode case mapping information in UnicodeData.txt?

A: No. The UnicodeData.txt file includes all of the 1:1 case mappings, but doesn't include 1:many mappings such as the one needed for uppercasing ß. Since many parsers now expect this file to have at most single characters in the case mapping fields, an additional file (SpecialCasing.txt) was added to provide the 1:many mappings. For more information, see UTR #21- Case Mappings [MD]

Q: Near the end of the SpecialCasing.txt, there are the two lines on SIGMA that look weird to me. Can you explain them:
# 03C3; 03C2; 03A3; 03A3; FINAL; # GREEK SMALL LETTER SIGMA
# 03C2; 03C3; 03A3; 03A3; NON_FINAL; # GREEK SMALL LETTER FINAL SIGMA
 

A: Both of these are conditional (column 5); that is, in normal Greek text a 03C3 (non-final sigma) should be written as 03C2 (final sigma) if it is at the end of a word, and a 03C2 (final sigma) should be written as a 03C3 (non-final sigma) if it is not at the end of a word. That's what these two lines would mean if they were uncommented. However, they are commented, just for that reason: the SpecialCasing file is not intended to normalize the appearance of a small sigma. [MD]

Q: Where are private use characters used, and how should they be handled?

A: Private use characters (also known as user defined characters) are used commonly in East Asia, particularly Japan, China, and Korea, to extend the available characters in various national standard and vendor character sets. The Unicode Standard also makes provision for private use characters. Since the Unicode Standard includes so many more standard characters than any other character encoding, there is less of a requirement for private use characters than in a typical legacy character set; however, there are occasionally cases where characters that are not yet in the standard need to be represented by codepoints in the Private Use Area (PUA). Some private use characters may never get standard encodings for one reason or another. Also, a particular implementation may choose to use private use characters for specific internal purposes.

It is relatively easy for Input Method Editors (IME) to allow private use characters to be added in the PUA, keeping track of the text sequence that should convert to those private use characters. With modern font technologies such as OpenType and AAT, these characters can also be added to fonts for display. However, the same codepoints in the PUA may be given different meanings in different contexts, since they are, after all, defined by users and are not standardized. If text comes, for example, from a legacy NEC encoding in Japan, the same codepoint in the PUA may mean something entirely different if interpreted on a legacy Fujitsu machine, even though both systems would share the same standard codepoints. For each given interpretation of a private use character one would have to pick the appropriate IME user dictionary and fonts to work with it.

One should not expect the rest of an operating system to override the character properties for these private use characters, since private use characters can have different meanings, depending on how they originated. In terms of line breaking, case conversions, and other textual processes, private use characters will typically be treated by the operating system as otherwise undistinguished letters (or ideographs) with no uppercase/lowercase distinctions. [MD] and [KW]

Q: The character name for the control character U+0082 is BREAK PERMITTED HERE. Does that mean I have to interpret that control character in that way?

A: The character names are actually undefined, and simply marked by "" to indicate their functional use. What you are thinking of as names are marked as aliases pointing to the ISO 6429 usage, as in http://www.unicode.org/charts/PDF/U0080.pdf.

The Unicode Standard does not define U+0082 to mean "BREAK PERMITTED HERE". It just says that it is a control code, one which in ISO 6429 has that name and meaning. Implementers of the Unicode Standard are not required to interpret the U+0082 in accordance with ISO 6429 (or to interpret it at all).

The standard does assign particular properties and semantics for the high-use controls, including tab, carriage return, line feed, form feed, and next line. But it does not give the majority of control codes any semantics at all; that is left to a higher-level protocol. [MD]

Q: Where can I find formal definitions of the terms used in the Character Name field of the UnicodeData.txt file? Most specifically, precise explanations of designations like "turned", "inverse", "inverted", "reversed", "rotated"

A: These terms are basically typographical rather than Unicode-specific.

A turned character is one that has been rotated 180 degrees around its center. A turned "e" winds up with the opening in the upper left portion. U+0259 LATIN SMALL LETTER SCHWA is a turned "e".
An inverted character has been flipped along the horizontal axis. An inverted "e" winds up with the opening in the upper right portion. There is no Unicode character representing an inverted "e". A reversed character has been flipped along the vertical axis.
A reversed "e" winds up with the opening in the lower left portion. U+0258 LATIN SMALL LETTER REVERSED E is an reversed "e".
A rotated character has been rotated 90 degrees, but one can't tell which way without looking at the glyph. U+213A ROTATED CAPITAL Q is a "Q" that has been rotated counterclockwise.
"Inverse" means that the white parts of the glyph are made black, and vice versa. An inverse "e" looks like a normal "e" but is white on a black background. There is no Unicode character representing an inverse "e". [JC]

Q: Are any unassigned characters or reserved characters given default properties?

A: The Bidi Algorithm (UAX #9) gives different default Bidi_Class property values to certain ranges of unassigned codepoints: see the discussion of the Bidi Class in UCD.html for details. This is different than the general policy of giving a single default value to all unassigned codepoints. Also look at the UCD file DerivedBidiClass.txt which assigns Bidi_Class values to the unassigned codepoints (anything not mentioned in that file belongs to class L).

Note: for each Unicode property, UCD.html also summarizes where to find the data for the property values, and the default value used for unassigned characters.

Q: Unicode now treats the SOFT HYPHEN as format control (Cf) character when formerly it was a punctuation character (Pd). Doesn't this break ISO 8859-1 compatibility?

A: No. The ISO 8859-1 standard defines the SOFT HYPHEN as "[a] graphic character that is imaged by a graphic symbol identical with, or similar to, that representing hyphen" (section 6.3.3), but does not specify details of how or when it is to be displayed, nor other details of its semantics. The soft hyphen has had a long history of legacy implementation in two or more incompatible ways.

Unicode clarifies the semantics of this character for Unicode implementations, but this does not affect its usage in ISO 8859-1 implementations. Processes that convert back and forth may need to pay attention to semantic differences between the standards, just as for any other character.

In a terminal emulation environment, particularly in ISO-8859-1 contexts, one could display the soft hyphen as a hyphen in all circumstances. The change in semantics of the Unicode character does not require that implementations of terminal emulators in other environments, such as ISO 8859-1, make any change in their current behavior.

Q: Where can I find the numerical values of characters with the Hexadecimal Digit (Hex_Digit) property?

A: The Unicode Standard provides the Hex_Digit property, which specifies which characters are hexadecimal digits: 0-9, A-F, a-f, and their fullwidth equivalents. (The ASCII_Hex_Digit property specifies the intersection of the Hex_Digit property and the Basic Latin block.) There is no table in the UCD mapping the hexadecimal digit characters to their values, analogous to the Numeric_Value property. The table linked here removes this real, if trivial, gap. [JC]

Q: Why is the hacek accent called "caron" in Unicode?

A: Nobody knows.

Legend has it that the term was first spotted in one of the 'giant books' from the '30s at Mergenthaler Linotype Company in Brooklyn, NY; but no one has been able to confirm that.

More accurate reports trace the term back to the mid '80s where we do have documented sightings of "caron" in publications such as:

  • The TypEncyclopedia by Frank Romano, ISBN: 0-835-21925-9, Libraries Unlimited; 1984
    p. 6 shows the mark with the notation "caron/hacek/clicka"

  • IBM's Green Book which has an original copyright date of 1986.
    "Caron Accent" appears on p. K-432, in a table entitled "Diacritic Mark Special Graphic Characters."
    National Language Support Reference Manual. 4th ed. 1994. (National Language Design Guide, 2)

  • SGML & Adobe documentation in this 1986 reference

Unicode and ISO 8859-x just carried the tradition along.

In an article published in 2001: "Orthographic diacritics and multilingual computing",  J.C. Wells - a linguist at the University College in London - writes:
"The term ‘caron’, however, is wrapped in mystery. Incredibly, it seems to appear in no current dictionary of English, not even the OED."

Whoever the originator is, we suspect that he has probably taken his secret to the grave by now. [Various authors]