My Training Period: xx hours. This Module just for informational provided for whom that interested in localization. Also, some of the information provided here is used to understand some of the convention used in Windows programming. ANSI C, C++ have been superseded by ISO/IEC standard. Program examples if any, compiled using Visual C++ .Net 2003.
A character set is a mapping of characters to their identifying code values (decimal, binary or hex).
Most of the character sets commonly used in computers are single-byte character sets in which each character is identified by a value one byte wide such as ASCII (American Standard Code for Information Interchange- refer to ASCII table at ASCII) code.
For a byte (8 bits), the maximum representation is 256 (28) characters, printable and non-printable. The large number of characters in Asian languages such as Chinese, Katakana and Thailand for example led to the development of multibyte character sets, in particular the double-byte character set (DBCS).
This is because a single byte is not enough any more to represent their scripts. The international standard for character encoding using multibyte is Unicode standard.
This is very useful for localization of the computer programs including the OSes.
Single-byte Character Sets
Double-byte Character Sets
|
Unicode is a worldwide character-encoding standard. The development and promotion the use of the Unicode is done by Unicode Consortium (Unicode). The ISO code is ISO/IEC 10646 and at the moment this compilation is prepared, the latest version was 4.1.0.
Windows NT, Windows 2000, and Windows XP use it exclusively at the system level for character and string manipulation. Unicode simplifies localization of software and improves multilingual text processing.
By implementing it in your applications, you can enable the application with universal data exchange capabilities for global marketing, using a single binary file for every possible character code.
Unicode defines semantics for each character, standardizes script behavior, provides a standard algorithm for bidirectional text, and defines cross-mappings to other standards.
Among the scripts supported (but not limited) by Unicode are Latin, Greek, Han, Hiragana, and Katakana, German, French, English, Greek, Chinese, and Japanese.
Because each Unicode code value is 16 bits wide, it is possible to have separate values for up to 65,536 (216) characters. Unicode-enabled functions are often referred to as "wide-character" functions.
Note that the implementation of Unicode in 16-bit values is referred to as UTF-16. For compatibility with 8- and 7-bit environments,UTF-8 and UTF-7 are two transformations of 16-bit Unicode values.
Windows supports applications that use either Unicode or the regular ANSI character set. Mixed use in the same application is also possible.
Adding Unicode support to an application is easy, and you can even maintain a single set of sources from which to compile an application that supports either Unicode or the Windows ANSI character set.
Functions support Unicode by assigning its strings a specific data type and providing a separate set of entry points and messages to support this new data type.
A series of macros and naming conventions make transparent migration to Unicode, or even compiling both non-Unicode and Unicode versions of an application from the same set of sources, a straightforward matter.
Implementing Unicode as a separate data type also enables the compiler's type checking to ensure that only Unicode parameters are used with functions expecting Unicode strings.
There is a need to support more characters than the 65,536 that fit in the 16-bit Unicode code space. For example, the Chinese speaking community alone uses over 55,000 characters.
To overcome this issue, the Unicode Standard defines surrogates. A surrogate or surrogate pair is a pair of 16-bit Unicode code values that represent a single character. The first (high) surrogate is a 16-bit code value in the range U+D800 to U+DBFF.
The second (low) surrogate is a 16-bit code value in the range U+DC00 to U+DFFF. Using surrogates, Unicode can support over one million characters.
Windows 2000 introduces support for basic input, output, and simple sorting of surrogates.
However, not all system components are surrogate compatible. Also, surrogates are not supported in Windows 95/98/Me.
Most string operations for Unicode can be written by using the same logic used for handling the Windows ANSI character set, except that the basic unit of operation is a 16-bit character instead of an 8-bit byte character, so we need an extra storage of memory.
The Windows header files provide several type definitions that make it easy to create sources that can be compiled for Unicode or the ANSI character set.
Windows API elements that use characters are generally implemented in one of the following three formats:
A generic version that can be compiled for either ANSI or Unicode.
An ANSI version.
A Unicode version.
The following example shows the method used in the Windows header files to define three sets of data types: a set of generic type definitions that can compile for either ANSI or Unicode, and two sets of specific type definitions.
The first set of specific type definitions is for use with the existing Windows (ANSI) character set, and the other is for use with Unicode (or wide) characters.
// generic types
#ifdef UNICODE
typedef wchar_t TCHAR;
#else
typedef unsigned char TCHAR;
#endif
typedef TCHAR * LPTSTR, *LPTCH;
// 8-bit character specific
typedef unsigned char CHAR;
typedef CHAR *LPSTR, *LPCH;
// Unicode specific (wide characters)
typedef unsigned wchar_t WCHAR;
typedef WCHAR *LPWSTR, *LPWCH;
The letter T in a type definition designates a generic type that can be compiled for either ANSI or Unicode.
The letter W in a type definition designates a wide-character (Unicode) type. For the actual implementation of this method, see the winnt.h header file.
An application using generic data types can be compiled for Unicode simply by defining UNICODE before the include statements for the header files, or during compilation.
To compile the code for ANSI, omit the UNICODE definition but when you see the Windows data type definition section in the previous Module of this Tutorial, these data types already defined properly, so this section just a matter of informational.
It is best to use the generic data types, but the specific types exist for applications that require mixed types.
The standard C run-time libraries contain wide-character versions of the ANSI string functions that begin with the letters str.
The wide-character versions of the functions start with the letters wcs (or sometimes _wcs). The Unicode data type is compatible with the wide-character data typewchar_t in ANSI C. This allows access to the wide-character string functions.
Generic functions exist for all standard C string functions. They start with the letters _tcs and are listed in the tchar.h header file. These functions use the generic data typesTCHAR andTCHAR*
An application must add the following lines to its program in order to use the generic functions and compile for Unicode:
#define _UNICODE
#include <tchar.h>
#include <wchar.h>
Note that both the tchar.h and wchar.h are required, and that the leading underscore on the_UNICODE variable is also required.
The printf() function defined in tchar.h supports the same format specifications aswsprintf(). Similarly, tchar.h contains a wprintf() function, in which the format string itself is a Unicode string.
The formatted output function wsprintf() supports Unicode by providing the following new and changed data types in its format specifications. These format specifications affect the way thewsprintf() function interprets the corresponding passed-in parameter.
Format specification | ANSI version | Unicode version |
c | CHAR | WCHAR |
C | WCHAR | CHAR |
hc,hC | CHAR | CHAR |
hs,hS | LPSTR | LPSTR |
lc,lC | WCHAR | WCHAR |
ls, lS | LPWSTR | LPWSTR |
s | LPSTR | LPWSTR |
S | LPWSTR | LPSTR |
Table 1: wsprintf() format specification. |
The data type for the output text always depends on the version of the function. Where the data type of the passed-in parameter and of the output text do not agree, wsprintf() performs a conversion from Unicode to ANSI, or vice versa, as required.
For the Unicode version of wsprintf(), the format string is Unicode, as is the output text.
As normal sprintf(), the wsprintf() function formats and stores a series of characters and values in a buffer. Any arguments are converted and copied to the output buffer according to the corresponding format specification in the format string.
The function appends a terminating null character to the characters it writes, but the return value does not include the terminating null character in its character count. The following Table lists thewsprintf() function information.
Item | Description |
Function | wsprintf(). |
Use | Formats and stores a series of characters and values in a buffer for Unicode. |
Prototype | int wsprintf( LPTSTR lpOut, LPCTSTR lpFmt, ... ); |
Parameters | lpOut - [out] Pointer to a buffer to receive the formatted output. The maximum size of the buffer is 1024 bytes. lpFmt - [in] Pointer to a null-terminated string that contains the format-control specifications. In addition to ordinary ASCII characters, a format specification for each argument appears in this string. ... - [in] Specifies one or more optional arguments. The number and type of argument parameters depend on the corresponding format-control specifications in thelpFmt parameter. |
Return value | If the function succeeds, the return value is the number of characters stored in the output buffer, not counting the terminating null character. If the function fails, the return value is less than the length of the expected output. To get extended error information, callGetLastError(). |
Include file | <windows.h> |
Table 2: wsprintf() function information. |
|
The second format specification causes the second argument to be converted and copied, and so on. If there are more arguments than format specifications, the extra arguments are ignored. If there are not enough arguments for all of the format specifications, the results are undefined.
A format specification has the following form:
%[-][#][0][width][.precision]type
Each field is a single character or a number signifying a particular format option. The type characters that appear after the last optional format field determine whether the associated argument is interpreted as a character, a string, or a number.
The simplest format specification contains only the percent sign and a type character (for example,%s). The optional fields control other aspects of the formatting.
The following Table lists the optional and required fields and their meanings.
Field | Meaning |
– | Pad the output with blanks or zeros to the right to fill the field width, justifying output to the left. If this field is omitted, the output is padded to the left, justifying it to the right. |
# | Prefix hexadecimal values with0x (lowercase) or0X (uppercase). |
0 | Pad the output value with zeros to fill the field width. If this field is omitted, the output value is padded with blank spaces. |
width | Copy the specified minimum number of characters to the output buffer. Thewidth field is a nonnegative integer. The width specification never causes a value to be truncated; if the number of characters in the output value is greater than the specified width, or if thewidth field is not present, all characters of the value are printed, subject to the precision specification. |
.precision | For numbers, copy the specified minimum number of digits to the output buffer. If the number of digits in the argument is less than the specified precision, the output value is padded on the left with zeros. The value is not truncated when the number of digits exceeds the specified precision. If the specified precision is 0 or omitted entirely, or if the period (.) appears without a number following it, the precision is set to 1. For strings, copy the specified maximum number of characters to the output buffer. |
type | Output the corresponding argument as a character, a string, or a number. This field can be any of the following values:
|
Table 3: Optional and required field for wsprintf(). |
If you have noticed, the explanation in this section is quite similar to other formatted output function such asprintf() except the format specifications used and it is for Unicode.
Win32 also provides function prototypes in generic, ANSI, and Unicode versions. The generic function prototypes can be compiled to produce either ANSI or Unicode prototypes.
As an example, all three prototypes are shown in the following code sample for the SetWindowText() function.
BOOL SetWindowText(HWND hwnd, LPCTSTR lpText);
The header file provides the generic function name implemented as a macro:
#ifdef UNICODE
#define SetWindowText SetWindowTextW
#else
#define SetWindowText SetWindowTextA
#endif // !UNICODE
The preprocessor expands the macro into either the ANSI or Unicode function names, depending on whether UNICODE is defined. The letter A (ANSI) or W (wide) is added at the end of the generic function name, as appropriate. The header file then provides ANSI and Unicode function prototypes, as shown in the following examples.
BOOL SetWindowTextA(HWND hwnd, LPCSTR lpText);
BOOL SetWindowTextW(HWND hwnd, LPCWSTR lpText);
Note that the generic function prototype uses the generic LPCTSTR for the text parameter, but the ANSI prototype uses LPCSTR, and the Unicode prototype usesLPCWSTR.
You can call the generic function in your application, and then define UNICODE when you compile the code to use the Unicode function.
To default to the ANSI function, do not define UNICODE. You can mix function calls by using the explicit function names ending with A and W.
This approach applies to all functions with text arguments. Always use a generic function prototype with the generic string and character types.
All function names that end with an uppercase W take wide-character arguments. Some functions exist only in wide-character versions and can be used only with the appropriate data types.
The Requirements section in the MSDN documentation for each function provides information on the function versions implemented by the system. If there is a line that begins with Unicode, the function has separate Unicode and ANSI versions.
Whenever a function has a length parameter for a character string, the length should be documented as a count ofTCHAR values in the string.
This refers to bytes for ANSI versions of the function or characters for Unicode versions. However, functions those require or return pointers to untyped memory blocks, such as the GlobalAlloc() function, are exceptions.
The MultiByteToWideChar() function can be used to map a character string to a wide-character (Unicode) string. The character string mapped by this function is not necessarily from a multibyte character set.
The information for this function is listed in the following Table.
Item | Description |
Function | MultiByteToWideChar(). |
Use | Maps a character string to a wide-character (Unicode) string. The character string mapped by this function is not necessarily from a multibyte character set. |
Prototype | int MultiByteToWideChar( UINT CodePage, // code page DWORD dwFlags, // character-type options LPCSTR lpMultiByteStr, // string to map int cbMultiByte, // number of bytes in string LPWSTR lpWideCharStr, // wide-character buffer int cchWideChar); // size of buffer |
Parameters | CodePage - [in] Specifies the code page to be used to perform the conversion. This parameter can be given the value of any code page that is installed or available in the system. You can also specify one of the values listed below:
dwFlags - [in] Indicates whether to translate to pre composed or composite-wide characters (if a composite form exists), whether to use glyph characters in place of control characters, and how to deal with invalid characters. You can specify a combination of the following flag constants.
A composite character consists of a base character and a non-spacing character, each having different character values. A pre-composed character has a single character value for a base/non-spacing character combination. In the character, the e is the base character and the accent grave mark is the non-spacing character. The function's default behavior is to translate to the pre-composed form. If a pre-composed form does not exist, the function attempts to translate to a composite form. The flagsMB_PRECOMPOSED and MB_COMPOSITE are mutually exclusive. TheMB_USEGLYPHCHARS flag and the MB_ERR_INVALID_CHARS can be set regardless of the state of the other flags. For the following code pages, dwFlags must be zero, otherwise the function fails with ERROR_INVALID_FLAGS.
Windows XP and later: MB_ERR_INVALID_CHARS is the only dwFlags value supported by Code page 65001 (UTF-8). lpMultiByteStr - [in] Points to the character string to be converted. cbMultiByte - [in] Specifies the size in bytes of the string pointed to by thelpMultiByteStr parameter, or it can be -1 if the string is null terminated. Note that if cbMultiByte is 0, the function fails. If this parameter is -1, the function processes the entire input string including the null terminator. The resulting wide character string therefore has a null terminator, and the returned length includes the null terminator. If this parameter is a positive integer, the function processes exactly the specified number of bytes. If the given length does not include a null terminator then the resulting wide character string will not be null terminated, and the returned length does not include a null terminator. lpWideCharStr - [out] Points to a buffer that receives the translated string. cchWideChar - [in] Specifies the size, in wide characters, of the buffer pointed to by thelpWideCharStr parameter. If this value is zero, the function returns the required buffer size, in wide characters, and makes no use of thelpWideCharStr buffer. |
Return value | If the function succeeds, and cchWideChar is nonzero, the return value is the number of wide characters written to the buffer pointed to bylpWideCharStr.If the function succeeds, andcchWideChar is zero, the return value is the required size, in wide characters, for a buffer that can receive the translated string. IfdwFlag equals zero, the input string is UTF-8 and contains invalid characters the function returnsERROR_NO_UNICODE_TRANSLATION. If the function fails, the return value is zero. To get extended error information, callGetLastError(). GetLastError() may return one of the following error codes:
|
Include file | <windows.h> |
Table 4: MultiByteToWideChar() information. |
Using the MultiByteToWideChar() function incorrectly can compromise the security of your application. Calling the MultiByteToWideChar() function can easily cause a buffer overrun because the size of the in buffer equals the number of bytes in the string, while the size of the Out buffer equals the number of WCHARs.
To avoid a buffer overrun, be sure to specify a buffer size appropriate for the data type the buffer receives.
The lpMultiByteStr and lpWideCharStr pointers must not be the same. If they are the same, the function fails, andGetLastError() returns the value ERROR_INVALID_PARAMETER.
The function fails if MB_ERR_INVALID_CHARS is set and encounters an invalid character in the source string. An invalid character is either:
A character that is not the default character in the source string but translates to the default character whenMB_ERR_INVALID_CHARS is not set, or
For DBCS strings, a character which has a lead byte but no valid trailing byte. When an invalid character is found, and MB_ERR_INVALID_CHARS is set, the function returns 0 and sets GetLastError() with the error ERROR_NO_UNICODE_TRANSLATION.
The WideCharToMultiByte() function can be used to map a wide-character string to a new character string. The new character string is not necessarily from a multibyte character set.
The following Table lists the information for WideCharToMultiByte().
Item | Description |
Function | WideCharToMultiByte(). |
Use | Maps a wide-character string to a new character string. The new character string is not necessarily from a multibyte character set. |
Prototype | int WideCharToMultiByte( UINT CodePage, // code page DWORD dwFlags, // performance and mapping flags LPCWSTR lpWideCharStr, // wide-character string int cchWideChar, // number of chars in string. LPSTR lpMultiByteStr, // buffer for new string int cbMultiByte, // size of buffer LPCSTR lpDefaultChar, // default for unmappable chars LPBOOL lpUsedDefaultChar); // set when default char used |
Parameters | CodePage - [in] Specifies the code page used to perform the conversion. This parameter can be given the value of any code page that is installed or available in the system. For a list of code pages, check MSDN documentation for Code Page Identifiers provided at the end of this Module. You can also specify one of the following values:
dwFlags - [in] Specifies the handling of unmapped characters. The function performs more quickly when none of these flags is set. The following flag constants are defined.
WhenWC_COMPOSITECHECK is specified, the function converts composite characters to pre-composed characters. A composite character consists of a base character and a non-spacing character, each having different character values. A pre-composed character has a single character value for a base/non-spacing character combination. In the character, the e is the base character, and the accent grave mark is the non-spacing character. When an application specifiesWC_COMPOSITECHECK, it can use the last three flags in this list (WC_DISCARDNS,WC_SEPCHARS, and WC_DEFAULTCHAR) to customize the conversion to pre-composed characters. These flags determine the function's behavior when there is no pre-composed mapping for a base/non-space character combination in a wide-character string. These last three flags can only be used if theWC_COMPOSITECHECK flag is set. The function's default behavior is to generate separate characters (WC_SEPCHARS) for unmapped composite characters. For the following code pages,dwFlags must be zero, otherwise the function fails with ERROR_INVALID_FLAGS.
lpWideCharStr - [in] Points to the wide-character string to be converted. cchWideChar - [in] Specifies the number of wide characters in the string pointed to by thelpWideCharStr parameter. If this value is -1, the string is assumed to be null-terminated and the length is calculated automatically. The length will include the null-terminator. Note that ifcchWideChar is zero the function fails. lpMultiByteStr - [out] Points to the buffer to receive the translated string. cbMultiByte - [in] Specifies the size, in bytes, of the buffer pointed to by thelpMultiByteStr parameter. If this value is zero, the function returns the number of bytes required for the buffer. In this case, thelpMultiByteStr buffer is not used. lpDefaultChar - [in] Points to the character used if a wide character cannot be represented in the specified code page. If this parameter isNULL, a system default value is used. The function is faster when bothlpDefaultChar and lpUsedDefaultChar are NULL. For the code pages mentioned in dwFlags,lpDefaultChar must beNULL; otherwise the function fails withERROR_INVALID_PARAMETER. lpUsedDefaultChar - [in] Points to a flag that indicates whether a default character was used. The flag is set to TRUE if one or more wide characters in the source string cannot be represented in the specified code page. Otherwise, the flag is set to FALSE. This parameter may be NULL. The function is faster when bothlpDefaultChar and lpUsedDefaultChar are NULL. For the code pages mentioned indwFlags,lpUsedDefaultChar must be NULL; otherwise the function fails withERROR_INVALID_PARAMETER. |
Return value | If the function succeeds, andcbMultiByte is nonzero, the return value is the number of bytes written to the buffer pointed to bylpMultiByteStr. The number includes the byte for the null terminator. If the function succeeds, andcbMultiByte is zero, the return value is the required size, in bytes, for a buffer that can receive the translated string. If the function fails, the return value is zero. To get extended error information, callGetLastError(). GetLastError() may return one of the following error codes:
|
Include file | <windows.h> |
Table 5: WideCharToMultiByte() information. |
Using the WideCharToMultiByte() function incorrectly can compromise the security of your application. Calling the WideCharToMultiByte() function can easily cause a buffer overrun because the size of the in buffer equals the number of WCHARs in the string, while the size of the Out buffer equals the number of bytes.
To avoid a buffer overrun, be sure to specify a buffer size appropriate for the data type the buffer receives. By the way Visual C++ .Net provides buffer overrun check during the compilation.
For strings that require validation, such as file, resource and user names, always use the WC_NO_BEST_FIT_CHARS flag with WideCharToMultiByte(). This flag prevents the function from mapping characters to characters that appear similar but have very different semantics.
In some cases, the semantic change can be extreme e.g., symbol for ‘∞’ (infinity) maps to 8 (eight) in some code pages.
WC_NO_BEST_FIT_CHARS is not available on Windows 95 and NT4. If your code must run on these platforms, you can achieve the same effect by round tripping the string using MultiByteToWideChar(). Any code point that does not round trip is a best-fit character.
The lpMultiByteStr and lpWideCharStr pointers must not be the same. If they are the same, the function fails, andGetLastError() returnsERROR_INVALID_PARAMETER.
If CodePage is CP_SYMBOL and cbMultiByte is less thancchWideChar, no characters are written tolpMultiByte. Otherwise, if cbMultiByte is less than cchWideChar, cbMultiByte characters are copied to the buffer pointed to bylpMultiByte.
An application can use the lpDefaultChar parameter to change the default character used for the conversion.
As noted earlier, the WideCharToMultiByte() function operates most efficiently when bothlpDefaultChar and lpUsedDefaultChar are NULL. The following table shows the behavior ofWideCharToMultiByte() for the four combinations oflpDefaultChar and lpUsedDefaultChar.
lpDefaultChar | lpUsedDefaultChar | Result |
NULL | NULL | No default checking. This is the most efficient way to use this function. |
non-NULL | NULL | Uses the specified default character, but does not set lpUsedDefaultChar. |
NULL | non-NULL | Uses the system default character and setslpUsedDefaultChar if necessary. |
non-NULL | non-NULL | Uses the specified default character and setslpUsedDefaultChar if necessary. |
Table 6: lpDefaultChar and lpUsedDefaultChar combination behaviors. |
Further reading and digging:
Check the best selling C, C++ and Windows books at Amazon.com.
Microsoft C references, online MSDN.
Microsoft Visual C++, online MSDN.
ReactOS - Windows binary compatible OS - C/C++ source code repository, Doxygen.
Structure, enum, union and typedef story can be found at C/C++ struct, enum, union & typedef.
Linux Access Control Lists (ACL) info can be found atAccess Control Lists.
Structure, enum, union and typedef story can be foundC/C++ struct, enum, union & typedef.
For Unicode and character set reference that contains functions, structures, macros and constants:Unicode and character set reference (MSDN).
Notation used in MSDN is Hungarian Notation instead of CamelCase and is discussedWindows programming notations.
Windows data type information is inWindows data types used in Win32 programming.