Unicode-KbdsonWindows

Unicode and Keyboards on Windows

Michael S. Kaplan

Cathy Wissink

Windows Globalization, Microsoft Corporation

1. Introduction To implementers, it seems inputting data into applications via keyboards should be one of the fundamentally simple features on Windows. However, once additional complexities like fonts and rendering engines are taken into consideration, input appears to be not quite so simple anymore. Adding many different keyboard layouts on top of over 135 locales further complicates the issue. And finally, once you include the ability to define keyboard layouts (whether by Microsoft interfaces or third party products) where all of Unicode can be supported, it becomes downright complex!

This paper will discuss the many features that keyboard layouts support (such as dead keys, shift states and ligatures), the interaction between input, fonts, and rendering engines, the issue of code pages vs. Unicode, when IMEs are preferred and when they are not, and the collation issues that enter into the equation. In the end, it will be clear that on Windows, the input of virtually any characters in Unicode is possible, even if in some cases more work is required than was originally expected.

2. The low-level details

Before diving into the details of a keyboard layout, it might be helpful to include a definition of a keyboard layout. A keyboard layout is the collection of data for each keystroke and shift state combination within a particular keyboard driver. It is not the physical keyboard that a user types on, but rather, the software that the hardware calls to output text streams to applications. Generally, anywhere this paper refers to a keyboard, keyboard layout is implied.

Starting with scan codes Keyboard input starts at the hardware level. The keys on the physical keyboard each have a value assigned to them called a scan code, and these scan codes are sent whenever you type a key. To complicate things, keyboard hardware varies depending on the geographical market; in many of the markets, you will find slightly different relationships between physical keys and scan codes. Because of this, layout maps (like the full Windows XP list, which can be found at http://microsoft.com/globaldev/keyboards/keyboards.asp) can be somewhat inaccurate in some parts of the world, since the maps assume that: (a) physical key placement is identical and (b) keys will have the same meaning, even if the hardware is different. For two examples of scan code maps that cover the main part of the keyboard, see Figure 1 for US keyboards and Figure 2 for most European keyboards.

23rd Internationalization and Unicode Conference 1 Prague, Czech Republic, March 2003


Figure 1: Scan codes for US keyboard hardware

Figure 2: Scan codes for European keyboard hardware

Note the different placement of scan codes between these two types of keyboards. For example, scan code 0x2b is on the second row of the US keyboard, but is at the end of the third row of the European keyboard. Scan code 0x56 is an additional scan code on the European keyboard, which is not on the US keyboard. The shape of the enter key is also different.

(Note also that these maps here do not show other keys such as the numeric keypad or the function keys; since those types of keys do not change when the language of the keyboard changes, they are not covered by this paper.)

Scan code values in the hardware are invariant. Allowing scan codes to change would make the support of multiple languages exceptionally difficult. This brings us to the Virtual Key values....

Virtual Key (VK) values As we progress from the hardware and move to the software level, what becomes crucial is the VK or Virtual Key value. These values fit within a byte (0x00 to 0xff) and are defined in winuser.h: the Platform SDK header file that contains procedure declarations, constant definitions and macros for the USER subsystem of Windows. You can see the virtual keys for the US English keyboard layout in Figure 3. The decision of how scan codes and virtual keys map to each other is made in the keyboard layout.



Figure 3: Virtual keys in the US English keyboard

Unfortunately for the implementer, the bulk of the most important VKs are not officially defined but are implied in the comments:

/* * VK_0 - VK_9 are the same as ASCII '0' - '9' (0x30 - 0x39) * 0x40 : unassigned * VK_A - VK_Z are the same as ASCII 'A' - 'Z' (0x41 - 0x5A) */ The rest of the virtual keys in use are explicitly defined constants, and there is no rule that keeps all virtual keys with the same keys on the keyboard; as you change keyboard layout, those values can change between different layouts. Note that in Figure 3, the implicit keys are all light gray, while the explicit "OEM" keys are white1. You can obtain an array containing the state of every VK by calling the GetKeyboardState API.

The VK values are important for the window messages that have to deal with keystrokes before they are processed by the USER subsystem in Windows, such as WM_KEYDOWN. Although there are minor changes in position between different keyboards even when the character values are the same, they do not change much between different keyboard layouts. Here is an example of a typical change: the letter "Q" is represented by the VK_Q on both the French and US English keyboards though on the French keyboard the "Q" and "A" keys are in reversed positions relative to the US keyboard (see the VK map for the French keyboard in Figure 4 for comparison with the US layout in Figure 3).

1 The OEM keys are keys that add punctuation and symbols. The ones that commonly change with different keyboards are OEM_1 through OEM_8, OEM_102, OEM_COMMA, OEM_PERIOD, OEM_PLUS, and OEM_MINUS. On these keyboard layout maps they are abbreviated with an O* prefix, followed by enough information to uniquely identify the key (e.g. O2 for OEM_2 and OP for OEM_PERIOD).



Figure 4: Virtual Keys in the French (France) keyboard

The position of the OEM keys often changes between different layouts as well. Most of the other VK positions are static. The changes are all quite minor when compared with the next step -- where those keystrokes are processed.

Processing keystrokes When a Windows message loop handles a VK in the WM_KEYDOWN message, it can pass the VK to the DefWindowProc API. To handle the message, the code in the USER subsystem will process the keystroke and convert it (when appropriate) to a character, passed as a WM_CHAR message. This processing requires a great deal of information:

the shift state the virtual key the current keyboard layout

Once all of this information is collected by the USER subsystem (that is, the keyboard layout is known for each thread and the WM_KEYDOWN message contains the VK and shift state), the code is then is able to come up with the appropriate character, taking all the information about shift states, VKs and current layout into account (obviously hitting arrow keys, for example, would not be expected to insert characters; USER will not have any of this extra character-based work run). You can mimic this behavior with several different Win32 APIs (see Table 1 for a list of the APIs that can be useful for this).

Table 1: Keyboard input functions and what they do

Function Description keybd_event Synthesizes a keystroke given a VK, a scan code, etc. (superceded by

the SendInput API)

MapVirtualKey Maps between scan codes, VKs, and characters for the current keyboard layout

MapVirtualKeyEx Maps between scan codes, VKs, and characters for a specified keyboard layout (layout must be loaded)

OemKeyScan Maps OEMASCII codes to OEM scan codes and shift states

SendInput Synthesizes a keystroke given a VK, a scan code, etc.

ToAscii Maps a VK and shift state to a character on the current keyboard layout's associated codepage



ToAsciiEx Maps a VK and shift state to a character on the specified keyboard layout's associated codepage (layout must be loaded)

ToUnicode Maps a VK and shift state to a Unicode character per the current keyboard layout

ToUnicodeEx Maps a VK and shift state to a Unicode character per the specified keyboard layout (layout must be loaded)

VkKeyScan Converts a character to a VK and shift state for the current keyboard layout

VkKeyScanEx Converts a character to a VK and shift state for the specified keyboard layout (layout must be loaded)

The functions in Table 1 are interesting in that when you read the descriptions, the functions appear to be duplicates of each other. However, once you start needing these functions in an application, you will see the small differences between these different functions can actually have a great deal of importance for obtaining the features you need.2

In any case, your code has now passed a character onto an application and inserted text! You can look at a few of the many keyboards supported on Windows (Figures 5-8) to help you see the wide variety of possible characters to be inserted.

Figure 5: The Divehi Phonetic keyboard layout

2 As an example, the definitions of MapVirtualKey and VkKeyScan seem similar, but the former does not handle shifted characters while the latter does. For more information, you can look at the Platform SDK: http://msdn.microsoft.com/library/en-us/winui/WinUI/WindowsUserInterface/UserInput/KeyboardInput.asp



Figure 6: The Georgian keyboard layout

Figure 7: The Gujarati keyboard layout

Figure 8: The Thai Kedmanee keyboard layout

3. Language features and their influence on input There are many features that keyboard input can require. These include:

single character keystrokes ligatures dead keys shift states AltGr shift states Control shift states



Caps lock key SGCap shift states extended shift states

Each of them is described below.

Single character keystrokes Obviously the mainstay of many of the keyboard layouts, a simple 1-1 mapping of keystrokes to characters is what the bulk of most keyboard layout will consist of. Some languages will use many other features as well, but all of them are likely to have at least a few of the single character keystrokes.

Ligatures There are many times that a single keystroke needs to enter more than one character. In keyboard nomenclature, these 1:many mappings are called ligatures.

Note that this definition of ligature is not identical to the one used in typography or in language orthographies; "ligature" here is used to identify multiple UTF-16 code points that are input by a single keystroke. This could be used in a number of ways: to represent a linguistic character consisting of multiple UTF-16 code points (such as Sri and Ksa seen on the Tamil keyboard, shown in Figure 9); to represent multiple linguistic characters which often work together in the language; or to develop a keyboard layout to handle a language represented by supplementary characters (such as the Deseret keyboard layout in Figure 10)3. (Technically, one could even create a keyboard with a keystroke that would insert "mike" or "cath" or "hiya" using a legal keyboard layout ligature -- as seen in the silly keyboard layout in Figure 11.)

Figure 9: The Tamil keyboard in the shifted state, showing linguistic characters Sri and Ksa as ligatures

3 Since keyboards support UTF-16 code points on Windows, the only way to handle supplementary characters on keyboards is via ligatures (the high surrogate and the low surrogate make a ligature). The process is seamless from the user perspective; the user will not experience any difference between supplementary characters and characters on the BMP, aside from a limitation of 4 UTF-16 code points on a single key.



Figure 10: A keyboard layout for Deseret, a language using supplementary characters (each represented by "ligatures" of UTF-16 high and low surrogates)

Figure 11: A very silly (but real!) keyboard layout (created by a developer for personal use). This shows the 4 UTF-16 character limit for a single keystroke.

Dead keys The dead key mechanism is either very intuitive or incredibly confusing, depending on your experience with legacy European keyboards. The basic concept is that you type a character defined on the particular keyboard as a dead key, then type a specific second character known as a base character. Rather than displaying these two characters, a unique third character known as a combining character will be shown. The reason the first character is defined as a "dead" key is that this character is not shown, and the cursor does not advance.

Dead keys are most commonly used in European keyboard layouts; a diacritic is generally used as the dead key. An example of this can be found on the Finnish keyboard, where typing a diaeresis (U+00A8) will initially do nothing, but then typing any of the characters in the first column in Table 2 will cause the character in the second column of Table 2 to be displayed. For example, if a user types a diaeresis, followed by a small letter a, Latin small letter A diaeresis () will be displayed.



Table 2: The Diaeresis dead key on the Finnish keyboard

Base Character Combining Character a

A

e

E

i

I

o

O

u

U

y

U+0020 U+00A8 ()

Any other character U+00A8+other character

The last two rows in gray of the above table are important to note. The first gray row is a common convention on most keyboards with dead keys; if you type the dead key and then a space, you will get the spacing version of the character. The second one is not a part of the keyboard layout definition, but is simply what happens if you type a dead key followed by a character that is not defined in the keyboard layout as a base character for that dead key: the deadkey is printed (input), followed by that second character. For example, Latin small letter C is not defined in the keyboard layout as being a base character for the diaeresis deadkey. If U+00A8 is typed, followed by c, those two code points will be input. No combining character will be created.

While deadkeys are not limited to European keyboard layouts, that is where they are most commonly used.

Shift states A keyboard layout typically has only 47 or 48 assigned physical keys on it; even the English alphabet would not fit, if you wanted both uppercase and lowercase A to Z (there wouldnt even be room for punctuation characters). Therefore, keyboards usually contain another set of 47 or 48 keys that can be accessed by pressing Shift in tandem with a character (for examples, see Figure 12 and 13 for the Greek keyboard in both the unshifted and shifted states).



Figure 12: The Greek keyboard layout (unshifted)

Figure 13: The Greek keyboard layout (shifted)

Note how most of the letter keys are actually cased versions of each other (also note the light gray keys; those are dead keys). By convention, most of the letters that have a cased version will usually see that version in the shifted state. However, some languages have no notion of case, so they do not need to use the shift state for this purpose.

AltGr shift states Some languages need more than 96 keys to input their language properly. Using just the shift state is not sufficient, so an additional shift state is added when Control+Alt is pressed. A shortcut to this key combination is to use the Right Alt key, also known as the AltGr key. This behavior is only expected for the keyboard layouts that define characters in the Control+Alt shift state. An example of this is the Polish keyboard layout (see Figures 14-16 for the unshifted, shifted, and Alt+Gr states of this keyboard).

Figure 14: The Polish keyboard layout (unshifted)



Figure 15: The Polish keyboard layout (shifted)

Figure 16: The Polish keyboard layout (Alt+Ctrl or AltGr)

You can also have an AltGr+Shift state as well; thankfully, few keyboards need this, as users find it difficult to type such characters.

Control shift states While it is technically possible to use the Control (CTRL) key as a shift character as well, it is highly discouraged. The reason is that many programs use the CTRL key for various command functions (such as Ctrl+S to mean "Save...") and many times if keystrokes are assigned in the keyboard layout, those keystrokes will not work properly in programs that specifically handle them for other purposes.

Caps Lock key The caps lock key is usually intended to be a version of the shift key that (a) only shifts characters that are cased versions of each other, and (b) stays shifted without having to hold down the key. On keyboard layouts for languages without a notion of case, the caps lock may do nothing, or it may be used for some other purpose entirely.

SGCap shift states Some keyboards use the Caps Lock key as an access point for an entirely independent shift state for some of the keys. Originally named for its use in the "Swiss German" keyboard, the SGCaps shift state is also used in the Czech and Hebrew keyboards to allow this extra shift state. Like



dead keys, they are either very intuitive if you are used to them and incredibly confusing if you arent familiar with them. The only real distinction of the SGCap shift states is that the Caps Lock key opens one to two entirely new shift states (an additional 96 characters, between the shifted and unshifted state). Using SGCap shift states in any other keyboards is discouraged unless you want a keyboard layout to have the same feel as one of the keyboards that uses the functionality.

Extended shift states It is technically possible to add up to three additional keys as "Shift" keys. When combined with all possible combinations of the other shift keys this would allow a total of 55 other shift states. Thankfully this feature is not used in any keyboards to its fullest extent; the Canadian Multilingual Standard keyboard layout is the only one that uses even a single extended shift state.

4. Other technologies and their impact on keyboards Many other features and functionalities in Windows can influence what is done with the text created by keyboards, depending on the complexity of the writing system. Several of them are listed in this section.

Rendering engines and what do they do The rendering engine has a difficult job. It is tasked with properly displaying complex script text4 in Windows and any running applications, which is a job made much more difficult by the wide variety of scripts and languages supported on Windows. On versions of Windows prior to Windows 2000, many clues about the language/script came from the HKL ("handle to a keyboard layout", now known as an input locale), since the LOWORD of the HKL is a language ID5. This usage has largely been deprecated on the newer versions of Windows6, which use the infinitely more sophisticated Uniscribe (Microsofts shaping engine technology) and its various engines that render text based on the writing system of the appropriate language. On downlevel platforms, however, you can still see a great deal of information being obtained by this value.

4 A complex script is any writing system that needs additional processing in order to properly display. For example, Arabic needs contextual shaping as well as bidirectional behavior, Vietnamese needs diacritic positioning, and Indic scripts sometimes need rearrangement of vowel marks. Uniscribe handles this kind of processing. 5 For more information, see the Platform SDK (http://msdn.microsoft.com/library/en-us/winui/WinUI/WindowsUserInterface/UserInput/KeyboardInput/KeyboardInputReference/KeyboardInputFunctions/GetKeyboardLayout.asp ) 6 This includes any NT-based version of Windows after Windows NT 4 (Windows 2000, Windows XP, and the upcoming Windows .NET Server 2003).



Shaping Engine

To storage, collation, etc.

Language? Kannada Uniscribe

Input method U+0C97, U+0CBF Script? Indic Keyboard.dll

Basis of Analysis? Syllable

Kbdinkan.dll

Unshifted VK_I Engine breaks run into syllables Unshifted VK_F 0C97 0CBF |

Code points Glyphs OpenType Layout Services

Glyph substitution To display Glyph positioning

Figure 17: The relationship between a keyboard, the rendering engine and display in a complex script (Kannada, an Indic script language).

Fonts What has diminished the importance of the HKL of a keyboard has been the increased selection of fonts available, as well as font linking (the borrowing of information from multiple fonts to obtain glyphs not in the current font), which was introduced in Windows 2000 and improved for Windows XP. Obviously for a keyboard to work well, it assumed that there will be at least one font somewhere on the machine to assist in displaying the inputted text, lest every character be replaced by a null glyph7.

IMEs -- when are they preferred? An Input Method Editor (IME) is a program that allows computer users to enter complex characters and symbols, such as Japanese Kanji characters, by using a standard keyboard. It is a solution to the issue of ideographic languages having tens of thousands of characters, or more. IMEs allow different, alternate means of input for such cases.

Attached to each IME is a keyboard layout. On Windows the convention has always been to attach it to the US English keyboard layout, although some third party IMEs might be attached to other keyboards. The reason that the US English keyboard is usually preferred is that non- 7 A null glyph is used when the font is not available on the system, generally in the shape of a box.



Unicode applications using CJK languages would be relying on default system code pages that would not include the text for other languages. Using a US English keyboard simplifies matters.

For more information on IMEs, see the Platform SDK.8

Dealing with code pages Although Windows keyboards are exclusively Unicode, it is important to note that if a keyboard is used with a non-Unicode application, some effort should be made to support this application when possible by choosing characters that fit with the appropriate Windows code page (ACP). Obviously this is not always feasible, since some languages are only supported by Unicode on Windows (e.g., Armenian, Georgian, Hindi, etc.), and thus do not have a system code page.

Sorting out collation issues For the most part, collation and keyboards do not have to interact. There is one major area where they can have an impact, and that is the fact that many keyboards (both ones from Microsoft and those provided by third parties) fail to have a consistent story in their use of composite versus precomposed characters. This can require an extra normalization step if the input is going to be used in XML and other technologies that expect normalized data.

Collation itself is handled well on Windows, with the proper equivalences between the composite and precomposed forms being an important part of the sorting data kept by the OS9.

5. Keeping it under the covers One of the most important features of keyboard layouts under Windows is the seamless behavior: everything discussed in this paperthe USER subsystem, font technology, shapingis not noticed by the vast majority of the people using the OS. Users simply run setup and choose a language, and everything seems to work. Obviously it is easy for this to not work properly if the user does not know what the content of their keyboard layout is, and their assumptions about what the layout should be turn out to be wrong. It is in fact the users expectations and assumptions around their keyboard choices that will often lead to the availability of multiple keyboard layout choices for a single language. For example, there are both Divehi Phonetic and Divehi Typewriter keyboard layouts in Windows XP, so that the user wanting to type Divehi text is more likely to find a layout that they prefer.

6. Factors in keyboard layout creation When developing keyboards for a particular market, a number of factors should be taken into consideration:

Is there some kind of keyboard standard for the region or country? It is sometimes required to have an input method which is sanctioned by the government or an appropriate governing body. Implementers should consider contacting their local or

8 See http://msdn.microsoft.com/library/en-us/intl/ime_5tiq.asp. 9 For more detailed information of collation on Windows, please see our talk Sorting it all out: an introduction to collation, available at http://www.microsoft.com/globaldev/Presentations/unicode22/016.doc



national standards body prior to developing a keyboard. In addition, implementers should consider de facto standards (that is, standards which are not official, but are used by so many people that they are considered standard).

What languages will the keyboard support? This should be explicitly determined before allocating keys to characters.

Does the keyboard provide input of all needed linguistic characters for the appropriate language(s)? This requirement can be met in a number of ways: via dead keys or additional shift states, for example (not all characters need to be on the unshifted state). High frequency linguistic characters should be positioned where they are easy to type, ideally in the unshifted state. (Note that if the keyboard supports multiple languages, the high frequency keys may change.)

Does the keyboard focus on code points, and not glyphs? It is important to not place the burden of display or shaping onto the keyboard. All technologies related to visual display are decoupled from the keyboard (and should be handled by fonts and a rendering engine if needed; see section 4 for more information).

Do all characters on the keyboard exist in Unicode? Since all input on Windows is based on Unicode (UTF-16), any code points not encoded in Unicode cannot be handled.

Are supplementary characters (non-BMP characters) encoded in UTF-16 and handled in the ligature section of the keyboard? Is the limit of 2 supplementary characters (4 UTF-16 code points) met on each key?

Ideally, a keyboard should be consistent in its behavior concerning precomposed vs.composite characters.

7. Myths about keyboard layouts

We hear many misconceptions about keyboards and what they can do. This section will hopefully clear up a few of these.

I get the feeling Microsoft just makes up these keyboards by themselves. Why dont they represent my language the way I expect them to?

New keyboards for a market always get tested in their respective market. A great deal of research does go into the keyboards shipped with the system, with feedback from linguists, government officials, other internationalization experts, and local software providers. Often it is the case of Beckers law applying (that is, for each expert, there is an equal and opposite expert), unfortunately.

I dont like the keyboard layout Windows ships for my language; can we remove it or change it?

In an ideal world, customers could customize their keyboard infinitely (and there are some projects out there that will simplify this process, which will be discussed at the presentation), but due to backwards compatibility, we cannot simply remove a keyboard or change keys. There are simply too many customers who count on consistent behavior across releases (even if the behavior is not ideal). In addition, while a customer may not like the keyboard, this may be a national standard for the language, and there may be a requirement to support this particular keyboard. There are a number of other input options to help users input characters not on their keyboards, including:

Character Map (available from Accessories|System Tools)




The Insert Symbol Dialog (available in Office) The ALT+X option, also available in Office. (Typing ALT+X after a character gives you

the Unicode value; typing ALT+X after a Unicode value gives you the character.)

I want to make sure I have every single visual variant of my characters on the keyboardthe canonical (or isolate) version of the code point is not sufficient.

As is discussed in the other technologies section, keyboards on Windows only deal with code points, not with glyphs. Code points are used exclusively for text processing, except for display. At the point of display, technologies such as fonts and rendering engines map between code points and glyphs. There is an important technical boundary between code points and glyphs, and this exists in order to maintain at least modicum of simplicity within the system. (Imagine if every single visual variant of a code point had to be maintained for text processing!) For this reason, keyboards focus exclusively on code points, and leave the work of linking code points to the appropriate visual display to fonts and shaping engines.

I want to have an IME rather than a keyboard for my language.

This is generally heard from customers working with complex script languages who feel that they need to have all visual variants of a code point on an input method. Input Method Editors really make sense with ideographic languages such as Chinese or Korean, where there are literally thousands of characters needed for the language. Each of these ideographic characters is semantically distinct. Compare this with complex scripts, where the number of semantically distinct characters is generally less than 100, but the number of visually distinct characters is considerable (into the hundreds). Again, keyboards work with code points, not with glyphs. Since code points are semantically distinct and not visually distinct, a complex script language can easily be handled via a keyboard; as noted earlier, the code points are linked to the appropriate visual display by other non-keyboard technologies.

8. Summary As has been described in this paper, the inner workings of keyboards are more complicated than a developer would probably like them to be. What is crucial is understanding the association between the virtual keys, the scan codes and the shift states in a keyboard. In addition, developers should understand the relationship input has to other technologies, once the keyboard passes on the code points (e.g., Uniscribe, font technologies and IMEs). This paper has only touched upon many of the issues, but we hope that it has provided implementers enough knowledge to avoid pitfalls, and provide customers with a seamless input experience.

Unicode and Keyboards on Windows1. Introduction2. The low-level detailsBefore diving into the details of a keyboard layout, it might be helpful to include a definition of a keyboard layout. A keyboard layout is the collection of data for each keystroke and shift state combination within a particular keyboard driver. It isStarting with scan codesVirtual Key (VK) valuesProcessing keystrokes

3. Language features and their influence on inputSingle character keystrokesLigaturesDead keysShift statesAltGr shift statesControl shift statesCaps Lock keySGCap shift statesExtended shift states

4. Other technologies and their impact on keyboardsRendering engines and what do they doFontsIMEs -- when are they preferred?Sorting out collation issues

5. Keeping it under the covers6. Factors in keyboard layout creation7. Myths about keyboard layouts8. Summary

Unicode-KbdsonWindows

Documents

Transcript of Unicode-KbdsonWindows