|< C & Win32 programming 8 | Main | ANSI, Multi-byte, Unicode and Localization 2 >|Site Index |Download |

MODULE G

CHARACTER SETS

ANSI, Multibyte, Unicode and Localization 1

What are in this Module?

Programming Abilities:
A Story of Character Sets
Single-byte Character Sets
Double-byte Character Sets
Unicode
Windows Issue
Surrogates
Windows Data Types for Strings
Standard C Functions
wsprintf() Function
Conventions for Function Prototypes – Win32 Programming
Generic Prototype
ANSI Prototype
Unicode Prototype
Character and Unicode strings Conversion

My Training Period: xx hours. This Module just for informational provided for whom that interested in localization. Also, some of the information provided here is used to understand some of the convention used in Windows programming. ANSI C, C++ have been superseded by ISO/IEC standard. Program examples if any, compiled using Visual C++ .Net 2003.

The programming abilities for this session:

Able to understand the variety of the character sets such as ANSI, Multibyte and Unicode.
Able to find and collect information related to functions used for character sets.
Able to understand and use the collected information about the functions in programs.
Able to understand the localization (locale).

A Story of Character Sets

A character set is a mapping of characters to their identifying code values (decimal, binary or hex).
Most of the character sets commonly used in computers are single-byte character sets in which each character is identified by a value one byte wide such as ASCII (American Standard Code for Information Interchange- refer to ASCII table at ASCII) code.
For a byte (8 bits), the maximum representation is 256 (2⁸) characters, printable and non-printable. The large number of characters in Asian languages such as Chinese, Katakana and Thailand for example led to the development of multibyte character sets, in particular the double-byte character set (DBCS).
This is because a single byte is not enough any more to represent their scripts. The international standard for character encoding using multibyte is Unicode standard.
This is very useful for localization of the computer programs including the OSes.

Single-byte Character Sets

A single-byte character set is a mapping of 256 individual characters to their identifying code values. The code values 0x20 (decimal 32) through 0x7E (decimal 126) represent standardized displayable characters, but the characters represented by the remaining codes vary among character sets.
The ASCII character set covers the range 0x00 through 0x7F.
The ANSI character set is used in the window manager (User) and graphics device interface (GDI), but the Microsoft MS-DOS file allocation table (FAT) file system uses the original equipment manufacturer (OEM) character set.
Variations on the character sets, called code pages, include different special characters, typically customized for a language or group of languages. The OEM code page 437 is generally used in the United States.
Applications can use Unicode to avoid the inconsistencies of varied code pages and as an aid in developing easily localized applications.
For Windows programming, an application can use the GetACP() function to retrieve the ANSI code-page identifier for the system or use theGetOEMCP() function to retrieve the OEM code-page identifier.
The OemToChar() and OemToCharBuff() functions allow an application to convert a character or string from the OEM code page to either the ANSI code page or Unicode.
To convert in the other direction, you can use either the CharToOem() orCharToOemBuff() function. In addition, an application can use theMultiByteToWideChar() andWideCharToMultiByte() functions to map single-byte character set (SBCS) strings to Unicode and map Unicode strings to SBCS strings.
The GetCPInfo() function fills a CPINFO structure with information that includes the size, in bytes, of the largest character in the code page and the default character used when a character code is entered that has no corresponding entry in the code page.

Double-byte Character Sets

The double-byte character set (DBCS) is called an expanded 8-bit character set because its smallest unit is a byte.
Some characters in a DBCS have a single byte code value and some have a double byte code value. A DBCS can be thought of as the ANSI character set for some Asian versions of Microsoft Windows (particularly the Japanese versions). Functions on the Japanese versions of Windows accept DBCS strings for the ANSI versions of the functions.
However, unlike the handling of Unicode, DBCS character handling requires detailed changes in the character-processing algorithms throughout an application's source code.
To help identify double-byte character sets, an application can use the IsDBCSLeadByte() function to determine whether a given character is the first byte in a 2-byte character.
In addition, an application can use the MultiByteToWideChar() and WideCharToMultiByte() functions to map DBCS strings to Unicode and map Unicode strings to DBCS strings.

Unicode

Unicode is a worldwide character-encoding standard. The development and promotion the use of the Unicode is done by Unicode Consortium (Unicode). The ISO code is ISO/IEC 10646 and at the moment this compilation is prepared, the latest version was 4.1.0.
Windows NT, Windows 2000, and Windows XP use it exclusively at the system level for character and string manipulation. Unicode simplifies localization of software and improves multilingual text processing.
By implementing it in your applications, you can enable the application with universal data exchange capabilities for global marketing, using a single binary file for every possible character code.
Unicode defines semantics for each character, standardizes script behavior, provides a standard algorithm for bidirectional text, and defines cross-mappings to other standards.
Among the scripts supported (but not limited) by Unicode are Latin, Greek, Han, Hiragana, and Katakana, German, French, English, Greek, Chinese, and Japanese.
Because each Unicode code value is 16 bits wide, it is possible to have separate values for up to 65,536 (2¹⁶) characters. Unicode-enabled functions are often referred to as "wide-character" functions.
Note that the implementation of Unicode in 16-bit values is referred to as UTF-16. For compatibility with 8- and 7-bit environments,UTF-8 and UTF-7 are two transformations of 16-bit Unicode values.

Windows Issue

Windows supports applications that use either Unicode or the regular ANSI character set. Mixed use in the same application is also possible.
Adding Unicode support to an application is easy, and you can even maintain a single set of sources from which to compile an application that supports either Unicode or the Windows ANSI character set.
Functions support Unicode by assigning its strings a specific data type and providing a separate set of entry points and messages to support this new data type.
A series of macros and naming conventions make transparent migration to Unicode, or even compiling both non-Unicode and Unicode versions of an application from the same set of sources, a straightforward matter.
Implementing Unicode as a separate data type also enables the compiler's type checking to ensure that only Unicode parameters are used with functions expecting Unicode strings.

Surrogates

There is a need to support more characters than the 65,536 that fit in the 16-bit Unicode code space. For example, the Chinese speaking community alone uses over 55,000 characters.
To overcome this issue, the Unicode Standard defines surrogates. A surrogate or surrogate pair is a pair of 16-bit Unicode code values that represent a single character. The first (high) surrogate is a 16-bit code value in the range U+D800 to U+DBFF.
The second (low) surrogate is a 16-bit code value in the range U+DC00 to U+DFFF. Using surrogates, Unicode can support over one million characters.
Windows 2000 introduces support for basic input, output, and simple sorting of surrogates.
However, not all system components are surrogate compatible. Also, surrogates are not supported in Windows 95/98/Me.

Windows Data Types for Strings

Most string operations for Unicode can be written by using the same logic used for handling the Windows ANSI character set, except that the basic unit of operation is a 16-bit character instead of an 8-bit byte character, so we need an extra storage of memory.
The Windows header files provide several type definitions that make it easy to create sources that can be compiled for Unicode or the ANSI character set.
Windows API elements that use characters are generally implemented in one of the following three formats:

A generic version that can be compiled for either ANSI or Unicode.

An ANSI version.

A Unicode version.

The following example shows the method used in the Windows header files to define three sets of data types: a set of generic type definitions that can compile for either ANSI or Unicode, and two sets of specific type definitions.
The first set of specific type definitions is for use with the existing Windows (ANSI) character set, and the other is for use with Unicode (or wide) characters.

// generic types

#ifdef UNICODE

typedef wchar_t TCHAR;

#else

typedef unsigned char TCHAR;

#endif

typedef TCHAR * LPTSTR, *LPTCH;

// 8-bit character specific

typedef unsigned char CHAR;

typedef CHAR *LPSTR, *LPCH;

// Unicode specific (wide characters)

typedef unsigned wchar_t WCHAR;

typedef WCHAR *LPWSTR, *LPWCH;

The letter T in a type definition designates a generic type that can be compiled for either ANSI or Unicode.
The letter W in a type definition designates a wide-character (Unicode) type. For the actual implementation of this method, see the winnt.h header file.
An application using generic data types can be compiled for Unicode simply by defining UNICODE before the include statements for the header files, or during compilation.
To compile the code for ANSI, omit the UNICODE definition but when you see the Windows data type definition section in the previous Module of this Tutorial, these data types already defined properly, so this section just a matter of informational.
It is best to use the generic data types, but the specific types exist for applications that require mixed types.

Standard C Functions

The standard C run-time libraries contain wide-character versions of the ANSI string functions that begin with the letters str.
The wide-character versions of the functions start with the letters wcs (or sometimes _wcs). The Unicode data type is compatible with the wide-character data typewchar_t in ANSI C. This allows access to the wide-character string functions.
Generic functions exist for all standard C string functions. They start with the letters _tcs and are listed in the tchar.h header file. These functions use the generic data typesTCHAR andTCHAR*
An application must add the following lines to its program in order to use the generic functions and compile for Unicode:

#define _UNICODE

#include <tchar.h>

#include <wchar.h>

Note that both the tchar.h and wchar.h are required, and that the leading underscore on the_UNICODE variable is also required.
The printf() function defined in tchar.h supports the same format specifications aswsprintf(). Similarly, tchar.h contains a wprintf() function, in which the format string itself is a Unicode string.

wsprintf() Function

The formatted output function wsprintf() supports Unicode by providing the following new and changed data types in its format specifications. These format specifications affect the way thewsprintf() function interprets the corresponding passed-in parameter.

Format specification	ANSI version	Unicode version
c	CHAR	WCHAR
C	WCHAR	CHAR
hc,hC	CHAR	CHAR
hs,hS	LPSTR	LPSTR
lc,lC	WCHAR	WCHAR
ls, lS	LPWSTR	LPWSTR
s	LPSTR	LPWSTR
S	LPWSTR	LPSTR
Table 1: wsprintf() format specification.

The data type for the output text always depends on the version of the function. Where the data type of the passed-in parameter and of the output text do not agree, wsprintf() performs a conversion from Unicode to ANSI, or vice versa, as required.
For the Unicode version of wsprintf(), the format string is Unicode, as is the output text.
As normal sprintf(), the wsprintf() function formats and stores a series of characters and values in a buffer. Any arguments are converted and copied to the output buffer according to the corresponding format specification in the format string.
The function appends a terminating null character to the characters it writes, but the return value does not include the terminating null character in its character count. The following Table lists thewsprintf() function information.

Item	Description
Function	wsprintf().
Use	Formats and stores a series of characters and values in a buffer for Unicode.
Prototype	int wsprintf( LPTSTR lpOut, LPCTSTR lpFmt, ... );
Parameters	lpOut - [out] Pointer to a buffer to receive the formatted output. The maximum size of the buffer is 1024 bytes. lpFmt - [in] Pointer to a null-terminated string that contains the format-control specifications. In addition to ordinary ASCII characters, a format specification for each argument appears in this string. ... - [in] Specifies one or more optional arguments. The number and type of argument parameters depend on the corresponding format-control specifications in thelpFmt parameter.
Return value	If the function succeeds, the return value is the number of characters stored in the output buffer, not counting the terminating null character. If the function fails, the return value is less than the length of the expected output. To get extended error information, callGetLastError().
Include file	<windows.h>
Table 2: wsprintf() function information.

Using this function incorrectly can compromise the security of your application. The string returned in lpOut is not guaranteed to be NULL-terminated.
Also, avoid the %s format. It can lead to a buffer overrun. If an access violation occurs it causes a denial of service against your application. In the worse case, an attacker can inject executable code.
Consider using one of the following alternative functions: StringCbPrintf(),StringCbPrintfEx(), StringCbVPrintf(), StringCbVPrintfEx(), StringCchPrintf(), StringCchPrintfEx(), StringCchVPrintf(), or StringCchVPrintfEx().
The format-control string contains format specifications that determine the output format for the arguments following thelpFmt parameter.
Format specifications, discussed below, always begin with a percent sign (%). If a percent sign is followed by a character that has no meaning as a format field, the character is not formatted (for example,%% produces a single percent-sign character).
The format-control string is read from left to right. When the first format specification (if any) is encountered, it causes the value of the first argument after the format-control string to be converted and copied to the output buffer according to the format specification.

The second format specification causes the second argument to be converted and copied, and so on. If there are more arguments than format specifications, the extra arguments are ignored. If there are not enough arguments for all of the format specifications, the results are undefined.
A format specification has the following form:

%[-][#][0][width][.precision]type

Each field is a single character or a number signifying a particular format option. The type characters that appear after the last optional format field determine whether the associated argument is interpreted as a character, a string, or a number.
The simplest format specification contains only the percent sign and a type character (for example,%s). The optional fields control other aspects of the formatting.
The following Table lists the optional and required fields and their meanings.

Field	Meaning
–	Pad the output with blanks or zeros to the right to fill the field width, justifying output to the left. If this field is omitted, the output is padded to the left, justifying it to the right.
#	Prefix hexadecimal values with0x (lowercase) or0X (uppercase).
0	Pad the output value with zeros to fill the field width. If this field is omitted, the output value is padded with blank spaces.
width	Copy the specified minimum number of characters to the output buffer. Thewidth field is a nonnegative integer. The width specification never causes a value to be truncated; if the number of characters in the output value is greater than the specified width, or if thewidth field is not present, all characters of the value are printed, subject to the precision specification.
.precision	For numbers, copy the specified minimum number of digits to the output buffer. If the number of digits in the argument is less than the specified precision, the output value is padded on the left with zeros. The value is not truncated when the number of digits exceeds the specified precision. If the specified precision is 0 or omitted entirely, or if the period (.) appears without a number following it, the precision is set to 1. For strings, copy the specified maximum number of characters to the output buffer.
type	Output the corresponding argument as a character, a string, or a number. This field can be any of the following values: c - Single character. This value is interpreted as typeWCHAR if the calling application defines Unicode and as type__wchar_t otherwise. C - Single character. This value is interpreted as type__wchar_t if the calling application defines Unicode and as type WCHAR otherwise. d - Signed decimal integer. This value is equivalent to i. hc, hC - Single character. Thewsprintf() function ignores character arguments with a numeric value of zero. This value is always interpreted as type__wchar_t, even when the calling application defines Unicode. hd - Signed short integer argument. hs, hS - String. This value is always interpreted as typeLPSTR, even when the calling application defines Unicode. hu - Unsigned short integer. i - Signed decimal integer. This value is equivalent to d. lc, lC - Single character. Thewsprintf() function ignores character arguments with a numeric value of zero. This value is always interpreted as typeWCHAR, even when the calling application does not define Unicode. ld - Long signed integer. This value is equivalent to li. li - Long signed integer. This value is equivalent to ld. ls, lS - String. This value is always interpreted as typeLPWSTR, even when the calling application does not define Unicode. This value is equivalent to ws. lu - Long unsigned integer. lx, lX - Long unsigned hexadecimal integer in lowercase or uppercase. p - Windows 2000/XP: Pointer. The address is printed using hexadecimal. s - String. This value is interpreted as typeLPWSTR when the calling application defines Unicode and as typeLPSTR otherwise. S - String. This value is interpreted as typeLPSTR when the calling application defines Unicode and as typeLPWSTR otherwise. u - Unsigned integer argument. x, X - Unsigned hexadecimal integer in lowercase or uppercase.
Table 3: Optional and required field for wsprintf().

If you have noticed, the explanation in this section is quite similar to other formatted output function such asprintf() except the format specifications used and it is for Unicode.

Conventions for Function Prototypes – Win32 Programming

Win32 also provides function prototypes in generic, ANSI, and Unicode versions. The generic function prototypes can be compiled to produce either ANSI or Unicode prototypes.
As an example, all three prototypes are shown in the following code sample for the SetWindowText() function.

Generic Prototype

BOOL SetWindowText(HWND hwnd, LPCTSTR lpText);

The header file provides the generic function name implemented as a macro:

#ifdef UNICODE

#define SetWindowText SetWindowTextW

#else

#define SetWindowText SetWindowTextA

#endif // !UNICODE

The preprocessor expands the macro into either the ANSI or Unicode function names, depending on whether UNICODE is defined. The letter A (ANSI) or W (wide) is added at the end of the generic function name, as appropriate. The header file then provides ANSI and Unicode function prototypes, as shown in the following examples.

ANSI Prototype

BOOL SetWindowTextA(HWND hwnd, LPCSTR lpText);

Unicode Prototype

BOOL SetWindowTextW(HWND hwnd, LPCWSTR lpText);

Note that the generic function prototype uses the generic LPCTSTR for the text parameter, but the ANSI prototype uses LPCSTR, and the Unicode prototype usesLPCWSTR.
You can call the generic function in your application, and then define UNICODE when you compile the code to use the Unicode function.
To default to the ANSI function, do not define UNICODE. You can mix function calls by using the explicit function names ending with A and W.
This approach applies to all functions with text arguments. Always use a generic function prototype with the generic string and character types.
All function names that end with an uppercase W take wide-character arguments. Some functions exist only in wide-character versions and can be used only with the appropriate data types.
The Requirements section in the MSDN documentation for each function provides information on the function versions implemented by the system. If there is a line that begins with Unicode, the function has separate Unicode and ANSI versions.
Whenever a function has a length parameter for a character string, the length should be documented as a count ofTCHAR values in the string.
This refers to bytes for ANSI versions of the function or characters for Unicode versions. However, functions those require or return pointers to untyped memory blocks, such as the GlobalAlloc() function, are exceptions.

Character and Unicode strings Conversion

The MultiByteToWideChar() function can be used to map a character string to a wide-character (Unicode) string. The character string mapped by this function is not necessarily from a multibyte character set.
The information for this function is listed in the following Table.

Item	Description
Function	MultiByteToWideChar().
Use	Maps a character string to a wide-character (Unicode) string. The character string mapped by this function is not necessarily from a multibyte character set.
Prototype	int MultiByteToWideChar( UINT CodePage, // code page DWORD dwFlags, // character-type options LPCSTR lpMultiByteStr, // string to map int cbMultiByte, // number of bytes in string LPWSTR lpWideCharStr, // wide-character buffer int cchWideChar); // size of buffer
Parameters	CodePage - [in] Specifies the code page to be used to perform the conversion. This parameter can be given the value of any code page that is installed or available in the system. You can also specify one of the values listed below: CP_ACP - ANSI code page. CP_MACCP - Macintosh code page. CP_OEMCP - OEM code page. CP_SYMBOL - Windows 2000/XP: Symbol code page (42). CP_THREAD_ACP - Windows 2000/XP: The current thread's ANSI code page. CP_UTF7 - Windows 98/Me, Windows NT 4.0 and later: Translate using UTF-7. CP_UTF8 - Windows 98/Me, Windows NT 4.0 and later: Translate using UTF-8. dwFlags - [in] Indicates whether to translate to pre composed or composite-wide characters (if a composite form exists), whether to use glyph characters in place of control characters, and how to deal with invalid characters. You can specify a combination of the following flag constants. MB_PRECOMPOSED - Always use pre-composed characters that is, characters in which a base character and a non-spacing character have a single character value. This is the default translation option. Cannot be used with MB_COMPOSITE. MB_COMPOSITE - Always use composite characters that is, characters in which a base character and a non-spacing character have different character values. Cannot be used withMB_PRECOMPOSED. MB_ERR_INVALID_CHARS - If the function encounters an invalid input character, it fails and GetLastError() returns ERROR_NO_UNICODE_TRANSLATION. MB_USEGLYPHCHARS - Use glyph characters instead of control characters. A composite character consists of a base character and a non-spacing character, each having different character values. A pre-composed character has a single character value for a base/non-spacing character combination. In the character, the e is the base character and the accent grave mark is the non-spacing character. The function's default behavior is to translate to the pre-composed form. If a pre-composed form does not exist, the function attempts to translate to a composite form. The flagsMB_PRECOMPOSED and MB_COMPOSITE are mutually exclusive. TheMB_USEGLYPHCHARS flag and the MB_ERR_INVALID_CHARS can be set regardless of the state of the other flags. For the following code pages, dwFlags must be zero, otherwise the function fails with ERROR_INVALID_FLAGS. 50220 50227 57002 through 57011 50221 50229 65000 (UTF7) 50222 52936 65001 (UTF8) 50225 54936 42 (Symbol) Windows XP and later: MB_ERR_INVALID_CHARS is the only dwFlags value supported by Code page 65001 (UTF-8). lpMultiByteStr - [in] Points to the character string to be converted. cbMultiByte - [in] Specifies the size in bytes of the string pointed to by thelpMultiByteStr parameter, or it can be -1 if the string is null terminated. Note that if cbMultiByte is 0, the function fails. If this parameter is -1, the function processes the entire input string including the null terminator. The resulting wide character string therefore has a null terminator, and the returned length includes the null terminator. If this parameter is a positive integer, the function processes exactly the specified number of bytes. If the given length does not include a null terminator then the resulting wide character string will not be null terminated, and the returned length does not include a null terminator. lpWideCharStr - [out] Points to a buffer that receives the translated string. cchWideChar - [in] Specifies the size, in wide characters, of the buffer pointed to by thelpWideCharStr parameter. If this value is zero, the function returns the required buffer size, in wide characters, and makes no use of thelpWideCharStr buffer.
Return value	If the function succeeds, and cchWideChar is nonzero, the return value is the number of wide characters written to the buffer pointed to bylpWideCharStr.If the function succeeds, andcchWideChar is zero, the return value is the required size, in wide characters, for a buffer that can receive the translated string. IfdwFlag equals zero, the input string is UTF-8 and contains invalid characters the function returnsERROR_NO_UNICODE_TRANSLATION. If the function fails, the return value is zero. To get extended error information, callGetLastError(). GetLastError() may return one of the following error codes: ERROR_INSUFFICIENT_BUFFER ERROR_INVALID_FLAGS ERROR_INVALID_PARAMETER ERROR_NO_UNICODE_TRANSLATION
Include file	<windows.h>
Table 4: MultiByteToWideChar() information.

Using the MultiByteToWideChar() function incorrectly can compromise the security of your application. Calling the MultiByteToWideChar() function can easily cause a buffer overrun because the size of the in buffer equals the number of bytes in the string, while the size of the Out buffer equals the number of WCHARs.
To avoid a buffer overrun, be sure to specify a buffer size appropriate for the data type the buffer receives.
The lpMultiByteStr and lpWideCharStr pointers must not be the same. If they are the same, the function fails, andGetLastError() returns the value ERROR_INVALID_PARAMETER.
The function fails if MB_ERR_INVALID_CHARS is set and encounters an invalid character in the source string. An invalid character is either:

A character that is not the default character in the source string but translates to the default character whenMB_ERR_INVALID_CHARS is not set, or

For DBCS strings, a character which has a lead byte but no valid trailing byte. When an invalid character is found, and MB_ERR_INVALID_CHARS is set, the function returns 0 and sets GetLastError() with the error ERROR_NO_UNICODE_TRANSLATION.

The WideCharToMultiByte() function can be used to map a wide-character string to a new character string. The new character string is not necessarily from a multibyte character set.
The following Table lists the information for WideCharToMultiByte().

Item	Description
Function	WideCharToMultiByte().
Use	Maps a wide-character string to a new character string. The new character string is not necessarily from a multibyte character set.
Prototype	int WideCharToMultiByte( UINT CodePage, // code page DWORD dwFlags, // performance and mapping flags LPCWSTR lpWideCharStr, // wide-character string int cchWideChar, // number of chars in string. LPSTR lpMultiByteStr, // buffer for new string int cbMultiByte, // size of buffer LPCSTR lpDefaultChar, // default for unmappable chars LPBOOL lpUsedDefaultChar); // set when default char used
Parameters	CodePage - [in] Specifies the code page used to perform the conversion. This parameter can be given the value of any code page that is installed or available in the system. For a list of code pages, check MSDN documentation for Code Page Identifiers provided at the end of this Module. You can also specify one of the following values: CP_ACP - ANSI code page. CP_MACCP - Macintosh code page. CP_OEMCP - OEM code page. CP_SYMBOL - Windows 2000/XP: Symbol code page (42). CP_THREAD_ACP - Windows 2000/XP: Current thread's ANSI code page. CP_UTF7 - Windows 98/Me, Windows NT 4.0 and later: Translate using UTF-7. When this is set,lpDefaultChar and lpUsedDefaultChar must be NULL. CP_UTF8 - Windows 98/Me, Windows NT 4.0 and later: Translate using UTF-8. When this is set,dwFlags must be zero and bothlpDefaultChar and lpUsedDefaultChar must be NULL. dwFlags - [in] Specifies the handling of unmapped characters. The function performs more quickly when none of these flags is set. The following flag constants are defined. WC_NO_BEST_FIT_CHARS - Windows 98/Me and Windows 2000/XP: Any Unicode characters that do not translate directly to multibyte equivalents are translated to the default character (seelpDefaultChar parameter). In other words, if translating from Unicode to multibyte and back to Unicode again does not yield the exact same Unicode character, the default character is used. This flag can be used by itself or in combination with the otherdwFlag options. WC_COMPOSITECHECK - Convert composite characters to pre-composed characters. WC_DISCARDNS - Discard non-spacing characters during conversion. WC_SEPCHARS - Generate separate characters during conversion. This is the default conversion behavior. WC_DEFAULTCHAR - Replace exceptions with the default character during conversion. WhenWC_COMPOSITECHECK is specified, the function converts composite characters to pre-composed characters. A composite character consists of a base character and a non-spacing character, each having different character values. A pre-composed character has a single character value for a base/non-spacing character combination. In the character, the e is the base character, and the accent grave mark is the non-spacing character. When an application specifiesWC_COMPOSITECHECK, it can use the last three flags in this list (WC_DISCARDNS,WC_SEPCHARS, and WC_DEFAULTCHAR) to customize the conversion to pre-composed characters. These flags determine the function's behavior when there is no pre-composed mapping for a base/non-space character combination in a wide-character string. These last three flags can only be used if theWC_COMPOSITECHECK flag is set. The function's default behavior is to generate separate characters (WC_SEPCHARS) for unmapped composite characters. For the following code pages,dwFlags must be zero, otherwise the function fails with ERROR_INVALID_FLAGS. 50220 50227 57002 through 57011 50221 50229 65000 (UTF7) 50222 52936 65001 (UTF8) 50225 54936 42 (Symbol) lpWideCharStr - [in] Points to the wide-character string to be converted. cchWideChar - [in] Specifies the number of wide characters in the string pointed to by thelpWideCharStr parameter. If this value is -1, the string is assumed to be null-terminated and the length is calculated automatically. The length will include the null-terminator. Note that ifcchWideChar is zero the function fails. lpMultiByteStr - [out] Points to the buffer to receive the translated string. cbMultiByte - [in] Specifies the size, in bytes, of the buffer pointed to by thelpMultiByteStr parameter. If this value is zero, the function returns the number of bytes required for the buffer. In this case, thelpMultiByteStr buffer is not used. lpDefaultChar - [in] Points to the character used if a wide character cannot be represented in the specified code page. If this parameter isNULL, a system default value is used. The function is faster when bothlpDefaultChar and lpUsedDefaultChar are NULL. For the code pages mentioned in dwFlags,lpDefaultChar must beNULL; otherwise the function fails withERROR_INVALID_PARAMETER. lpUsedDefaultChar - [in] Points to a flag that indicates whether a default character was used. The flag is set to TRUE if one or more wide characters in the source string cannot be represented in the specified code page. Otherwise, the flag is set to FALSE. This parameter may be NULL. The function is faster when bothlpDefaultChar and lpUsedDefaultChar are NULL. For the code pages mentioned indwFlags,lpUsedDefaultChar must be NULL; otherwise the function fails withERROR_INVALID_PARAMETER.
Return value	If the function succeeds, andcbMultiByte is nonzero, the return value is the number of bytes written to the buffer pointed to bylpMultiByteStr. The number includes the byte for the null terminator. If the function succeeds, andcbMultiByte is zero, the return value is the required size, in bytes, for a buffer that can receive the translated string. If the function fails, the return value is zero. To get extended error information, callGetLastError(). GetLastError() may return one of the following error codes: ERROR_INSUFFICIENT_BUFFER ERROR_INVALID_FLAGS ERROR_INVALID_PARAMETER
Include file	<windows.h>
Table 5: WideCharToMultiByte() information.

Using the WideCharToMultiByte() function incorrectly can compromise the security of your application. Calling the WideCharToMultiByte() function can easily cause a buffer overrun because the size of the in buffer equals the number of WCHARs in the string, while the size of the Out buffer equals the number of bytes.
To avoid a buffer overrun, be sure to specify a buffer size appropriate for the data type the buffer receives. By the way Visual C++ .Net provides buffer overrun check during the compilation.
For strings that require validation, such as file, resource and user names, always use the WC_NO_BEST_FIT_CHARS flag with WideCharToMultiByte(). This flag prevents the function from mapping characters to characters that appear similar but have very different semantics.
In some cases, the semantic change can be extreme e.g., symbol for ‘∞’ (infinity) maps to 8 (eight) in some code pages.
WC_NO_BEST_FIT_CHARS is not available on Windows 95 and NT4. If your code must run on these platforms, you can achieve the same effect by round tripping the string using MultiByteToWideChar(). Any code point that does not round trip is a best-fit character.
The lpMultiByteStr and lpWideCharStr pointers must not be the same. If they are the same, the function fails, andGetLastError() returnsERROR_INVALID_PARAMETER.
If CodePage is CP_SYMBOL and cbMultiByte is less thancchWideChar, no characters are written tolpMultiByte. Otherwise, if cbMultiByte is less than cchWideChar, cbMultiByte characters are copied to the buffer pointed to bylpMultiByte.
An application can use the lpDefaultChar parameter to change the default character used for the conversion.
As noted earlier, the WideCharToMultiByte() function operates most efficiently when bothlpDefaultChar and lpUsedDefaultChar are NULL. The following table shows the behavior ofWideCharToMultiByte() for the four combinations oflpDefaultChar and lpUsedDefaultChar.

lpDefaultChar	lpUsedDefaultChar	Result
NULL	NULL	No default checking. This is the most efficient way to use this function.
non-NULL	NULL	Uses the specified default character, but does not set lpUsedDefaultChar.
NULL	non-NULL	Uses the system default character and setslpUsedDefaultChar if necessary.
non-NULL	non-NULL	Uses the specified default character and setslpUsedDefaultChar if necessary.
Table 6: lpDefaultChar and lpUsedDefaultChar combination behaviors.

Further reading and digging:

Check the best selling C, C++ and Windows books at Amazon.com.
Microsoft C references, online MSDN.
Microsoft Visual C++, online MSDN.
ReactOS - Windows binary compatible OS - C/C++ source code repository, Doxygen.
Structure, enum, union and typedef story can be found at C/C++ struct, enum, union & typedef.
Linux Access Control Lists (ACL) info can be found atAccess Control Lists.
Structure, enum, union and typedef story can be foundC/C++ struct, enum, union & typedef.
For Unicode and character set reference that contains functions, structures, macros and constants:Unicode and character set reference (MSDN).
Notation used in MSDN is Hungarian Notation instead of CamelCase and is discussedWindows programming notations.
Windows data type information is inWindows data types used in Win32 programming.

|< C & Win32 programming 8 | Main | ANSI, Multi-byte, Unicode and Localization 2 >|Site Index |Download |

|< C & Win32 programming 8 | Main | ANSI, Multi-byte, Unicode and Localization 2 >|Site Index |Download |

MODULE G

CHARACTER SETS

ANSI, Multibyte, Unicode and Localization 1

What are in this Module?

Programming Abilities:

A Story of Character Sets

Single-byte Character Sets

Double-byte Character Sets

Unicode

Windows Issue

Surrogates

Windows Data Types for Strings

Standard C Functions

wsprintf() Function

Conventions for Function Prototypes – Win32 Programming

Generic Prototype

ANSI Prototype

Unicode Prototype

Character and Unicode strings Conversion

The programming abilities for this session:

Able to understand the variety of the character sets such as ANSI, Multibyte and Unicode.

Able to find and collect information related to functions used for character sets.

Able to understand and use the collected information about the functions in programs.

Able to understand the localization (locale).

A Story of Character Sets

Single-byte Character Sets

Double-byte Character Sets

Unicode

Windows Issue

Surrogates

Windows Data Types for Strings

Standard C Functions

wsprintf() Function

Conventions for Function Prototypes – Win32 Programming

Generic Prototype

ANSI Prototype

Unicode Prototype

Character and Unicode strings Conversion

|< C & Win32 programming 8 | Main | ANSI, Multi-byte, Unicode and Localization 2 >|Site Index |Download |